Web Mining

Report By,

Faten Al Zahrani Abeer Al Nasser

1-Introduction

With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server side and client side intelligent systems that can effectively mine for knowledge. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available online, i.e. Web content mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining.

Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the Worldwide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent based technology may also fall in this category. Web structure mining is the process of inferring knowledge from the Worldwide Web organization and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs.

1.1 Characteristics of web data.

There are many characteristics of web data:

  • The data on the Web is huge in amount.Now; it is hard to estimate the exact data volume available on the Internet due to the exponential growth of Web data daily. For instance ,one of the first Web search engines is called the World Wide Web Worm (WWWW) had an index of 110,000 Web pages and Web accessible documents in 1994.

In November 1997, the top search engines claim to index from 2 million to 100 million Web documents. The big volume of data on the Web makes it difficult to deal with Web data by traditional database techniques.

  • The web data is distributed and heterogeneous: Due to the essential property of Web being an interconnection of various nodes over the Internet, Web data is usually distributed across a wide range of computers or servers, which are located at different places around the world. At the same time, Web not includes only the textual content but also multimedia content such as images, audio files and video. It requires the developed techniques for Web data processing with the ability of dealing with heterogeneity of multimedia data.
  • The data on the Web is unstructured.There are, so far, no rigid and uniform data structures or schemas that Web pages should strictly follow.Instead, Web designers are able to organize related information on the Web together in their own ways such as HTML format. Although Web pages in well-defined HTML format could contain some preliminary Web data structures e.g. tags or anchors.

As a result, there is an increasing requirement to better deal with the unstructured nature of Web documents and extract the mutual relationships hidden in Web data for facilitating users to locate needed Web information or service.

  • The data on the Web is dynamic. The implicit and explicit structure of Web data is updated frequently. Especially, due to different applications of Web-based data management systems, a variety of presentations of Web documents will be generated while contents resided in databases update.

1.2 Web community.

A web community or Virtual community is a social network of individuals who interact through specific media, potentially crossing geographical and political boundaries in order to pursue mutual interests or goals. One of the most pervasive types of virtual community includessocial networking services, which consist of various online communities.

The term web community or Virtual Community is attributed to the book of the same title by Howard Rheingold, published in 1993. The book, which could be considered a social enquiry, putting the research in the social sciences, discussed his adventures on The WELL and onward into a range of computer-mediated communication and social groups, broadening it to information science. The technologies included Usenet, MUDs (Multi-User Dungeon) and their derivatives MUSHes and MOOs, Internet Relay Chat (IRC), chat rooms and electronic mailing lists; the World Wide Web as we know it today was not yet used by many people. Rheingold pointed out the potential benefits for personal psychological well-being, as well as for society at large, of belonging to such a group.

These virtual communities Virtual all encourage interaction, sometimes focusing around a particular interest, or sometimes just to communicate. Quality virtual communities do both. They allow users to interact over a shared passion, whether it is through message boards, chat rooms, social networking sites, or virtual worlds.

A web community is a web site (or group of web sites) that is a virtual community. A web community may take the form of a social network service, an Internet forum, a group of blogs, or another kind of social softwareweb application.

2-What is a web-mining?

The term Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics, new promotions floating on the Internet, etc. There is a growing trend among companies, organizations and individuals alike to gather information through web data mining to utilize that information in their best interest.

Data Mining is done through various types of data mining software. These can be simple data mining software or highly specific for detailed and extensive tasks that will be sifting through more information to pick out finer bits of information. For example, if a company is looking for information on doctors including their emails, fax, telephone, location, etc., this information can be mined through one of these data mining software programs. This information collection through data mining has allowed companies to make thousands and thousands of dollars in revenues by being able to better use the internet to gain business intelligence that helps companies make vital business decisions.

Before this data mining software came into being, different businesses used to collect information from recorded data sources. But the bulk of this information is too much too daunting and time consuming to gather by going through all the records, therefore the approach of computer based data mining came into being and has gained huge popularity to now become a necessity for the survival of most businesses.

This collected information is used to gain more knowledge and based on the findings and analysis of the information make predictions as to what would be the best choice and the right approach to move toward on a particular issue. Web data mining is not only focused to gain business information but is also used by various organizational departments to make the right predictions and decisions for things like business development, work flow, production processes and more by going through the business models derived from the data mining.

A strategic analysis department can undermine their client archives with data mining software to determine what offers they need to send to what clients for maximum conversions rates. For example, a company is thinking about launching cotton shirts as their new product. Through their client database, they can clearly determine as to how many clients have placed orders for cotton shirts over the last year and how much revenue such orders have brought to the company.

After having a hold on such analysis, the company can make their decisions about which offers to send both to those clients who had placed orders on the cotton shirts and those who had not. This makes sure that the organization heads in the right direction in their marketing and not goes through a trial and error phase to learn the hard facts by spending money needlessly. These analytical facts also shed light as to what the percentage of customers is who can move from your company to your competitor.

The data mining also empowers companies to keep a record of fraudulent payments which can all be researched and studied through data mining. This information can help develop more advanced and protective methods that can be undertaken to prevent such events from happening. Buying trends shown through web data mining can help you to make forecast on your inventories as well. This is a direct analysis, which will empower the organization to fill in their stocks appropriately for each month depending on the predictions they have laid out through this analysis of buying trends.

The data mining technology is going through a huge evolution and new and better techniques are made available all the time to gather whatever information is required. Web data mining technology is opening avenues on not just gathering data but it is also raising a lot of concerns related to data security. There is loads of personal information available on the internet and web data mining had helped to keep the idea of the need to secure that information at the forefront.

3. Data Mining vs. Web mining.

Data mining refers to extracting informative knowledge from a large amount of data, which could be expressed in different data types, such as transaction data in e-commerce applications or genetic expressions. No matter which type of data it is, the main purpose of data mining is discovering hidden knowledge, normally in the forms of patterns, from available data repository.

What is the difference between data mining and web mining? Well, one of the significant factors is the structure of the mining data. Common data mining applications discover patterns in a structured data such as database (i.e. DBMS). Web mining, likewise discover patterns in a less structured data such as Internet (WWW). In other words, we can say that Web Mining is Data Mining techniques applied to the WWW.

4-Types of web mining

Basically theweb mining is of three types:

1. Web Usage mining process

In theweb usage mining process, the techniques ofdata mining are applied so as to discover thetrends and thepatterns in the browsing nature of the visitors of the website. There is extraction of thenavigation patterns as the browsing patterns could be traced and the structure of the website can be designed accordingly. For example, a particular feature of website that is used by the visitors frequently, then you must look forward to enhance and pronounce so as to increase the usage that can appeal more to users of the website. This kind ofmining makes use of accesses and logs of the web. Simply by understanding the movement of the guests and the behavior of surfing the net, you can look forward to meet the preferences and the needs in a better manner and popularize your website among the masses in the internet arena.

2. Web Content Mining

Such kind ofmining process attempts to discover all links of the hyperlinks in a document so as to generate the structural report on a web page. Theinformation regarding the different facets, for instance, if the users are in a position to find the information, if the structure of the website is too shallow or deep, whether the elements of the web page are correctly placed, the least visited and the most visited website areas and whether they have something to do with page design, etc. Such kinds of things are analyzed and evaluated for deep research.

3. Web Linkage/Structure mining

This involves the usage ofgraph theory for analyzing the connections and node structure of the website. According to the type and nature of the data of the web structure, it is again divided into two kinds:

  • Extraction of patterns from the hyperlinks on the net: The hyperlink is structural form of web address connecting a web page to some other location.
  • Mining of the structure of the document: The tree like structure gets used for analyzing and describing the XHTML or the HTML tags in the web page.

4.1-Web content mining

Web content mining, also known as text mining, is generally the second step in Web data mining. Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance of the content to the search query. This scanning is completed after the clustering of web pages through structure mining and provides the results based upon the level of relevance to the suggested query. With the massive amount of information that is available on the World Wide Web, content mining provides the results lists to search engines in order of highest relevance to the keywords in the query.

Text mining is directed toward specific information provided by the customer search information in search engines. This allows for the scanning of the entire Web to retrieve the cluster content triggering the scanning of specific Web pages within those clusters. The results are pages relayed to the search engines through the highest level of relevance to the lowest. Though, the search engines have the ability to provide links to Web pages by the thousands in relation to the search content, this type of web mining enables the reduction of irrelevant information.

Web text mining is very effective when used in relation to a content database dealing with specific topics. For example online universities use a library system to recall articles related to their general areas of study. This specific content database enables to pull only the information within those subjects, providing the most specific results of search queries in search engines. This allowance of only the most relevant information being provided gives a higher quality of results. This increase of productivity is due directly to use of content mining of text and visuals.

The main uses for this type of data mining are to gather, categorize, organize and provide the best possible information available on the WWW to the user requesting the information. This tool is imperative to scanning the many HTML documents, images, and text provided on Web pages. The resulting information is provided to the search engines in order of relevance giving more productive results of each search.

Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through thousands of results to find the most relevant information to his query. Thousands of results through use of mining text are reduced by this step. This eliminates the frustration and improves the navigation of information on the Web.

Business uses of content mining allow for the information provided on their sites to be structured in a relevance-order site map. This allows for a customer of the Web site to access specific information without having to search the entire site. With the use of this type of mining, data remains available through order of relativity to the query, thus providing productive marketing.
Used as a marketing tool this provides additional traffic to the Web pages of a company’s site based on the amount of keyword relevance the pages offer to general searches.
As the second section of data mining, text mining is useful to improve the productive uses of mining for businesses, Web designers, and search engines operations. Organization, categorization, and gathering of the information provided by the WWW become easier and produce results that are more productive through the use of this type of mining.

In short, the ability to conduct Web content mining allows results of search engines to maximize the flow of customer clicks to a Web site, or particular Web pages of the site, to be accessed numerous times in relevance to search queries. The clustering and organization of Web content in a content database enables effective navigation of the pages by the customer and search engines. Images, content, formats and Web structure are examined to produce a higher quality of information to the user based upon the requests made. Businesses can maximize the use of this text mining to improve marketing of their sites as well as the products they offer.

4.2- web linkage mining

Web Linkage orWeb Structure Miningis the organization of the content via HTML and XML tags. Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. This structure data is discoverable by the provision of web structure schema through database techniques for Web pages. This connection allows a search engine to pull data relating to a search query directly to the linking Web page from the Web site the content rests upon. This completion takes place through use of spiders scanning the Web sites, retrieving the home page, then, and linking the information through reference links to bring forth the specific page containing the desired information.