ARIST Webometrics - Outline

ARIST Webometrics 25/11/2003 17:34Page 1 of 42

Webometrics

Mike Thelwall, Liwen Vaughan, Lennart Björneborn

1Introduction

1.1Webometrics, Bibliometrics and Informetrics

1.2Scope of Coverage and Related Reviews

2Basic Concepts and Methods

2.1Terminology

2.1.1Basic link terminology

2.1.2Basic web node terminology and diagrams

2.2Units of Analysis

2.2.1The Definition of the Web

2.2.2Alternative Document Models

2.3Data Collection Methods

2.3.1Personal Web Crawlers

2.3.2Commercial Search Engines

2.3.3Web Log Files

2.4Selection and Sampling Methods

2.4.1Sampling Pages

2.4.2Sampling Sites

2.4.3Large Area Web Studies

3Scholarly Communication on the Web

3.1Journals and Scholars

3.2National University Systems

3.3International University Comparisons

3.4Departments Within a Discipline

3.5Link Creation Motivations

3.6Conclusions

4General and Commercial Web Use

4.1Measurements of the Web

4.2Web Server Log Analysis

4.3Commercial Web Site Studies

4.4Social Network Analysis

5Topological Modeling and Mining of the Web

5.1Network Models of Linking Behavior

5.2Overall Web Topology

5.3Clustering in the Web

5.4Small Worlds

6Summary and Concluding Remarks

7References

1Introduction

Webometrics, the quantitative study of web-related phenomena, originated in the realization that methods originally designed for bibliometric analysis of scientific journal article citation patterns could be applied to the Web with commercial search engines providing the raw data. Almind and Ingwersen (1997) defined the discipline and gave its name, although the basic issue had been identified simultaneously by Rodríguez Gairín (1997) and was pursued in Spain by Aguillo (1998). Larson (1996) is also a pioneer with his early exploratory link structure analysis, as is Rousseau (1997) with the first pure informetric analysis of the Web. We interpret webometrics in a broad sense encompassing research from disciplines outside of Information Science such as Communication Studies, Statistical Physics and Computer Science. In this review we will concentrate on types of link analysis but also cover other webometric areas that Information Scientists have been involved with, including web log file analysis.

One theme that runs through this chapter is the messiness of web data and the need for heuristics to cleanse it. This is a problem even at the most basic level of defining the Web. The uncontrolled Web creates numerous problems in the interpretation of results, for instance from the automatic creation or replication of links and deliberately misleading publishing. The loose connection between the apparent usage of top level domains and their actual content is also a frustrating problem, for example with the extensive non-commercial content hosted in .com sites. Indeed a skeptical researcher could claim that the obstacles of this kind are so great that all web analyses have little value. As will be seen below, one response to this perspective – also a recurrent theme for critics of evaluative bibliometrics – is to demonstrate significant correlation statistics to prove that information is present. A practical response has been to develop increasingly sophisticated data cleansing strategies and multiple data analysis techniques. The immense importance of the Web to scholars and the wider society means that it is essential to build an understanding of it, however difficult.

This review is split into four parts: basic concepts and methods; scholarly communication on the Web; general and commercial web use; and topological modelling and mining of the Web. As a new field based around analysing a new data source, methods of collecting and processing the data have been prominent in many studies. The second part, scholarly communication on the Web, is predominantly concerned with using link analysis to identify patterns in academic or scholarly web spaces. Almost all of these studies have direct analogies in traditional bibliometrics, and have drawn from this area a concern with developing effective methods and validating results, the latter being an issue of particular concern on the Web. A key question that still does not have a satisfactory answer is how to interpret counts of links to academic web spaces. For example, if one university web site attracts double the links of another, what conclusions should be drawn?

The general and commercial web use section reviews link analysis studies that have used techniques similar to those applied to academic web spaces. Some have origins in Social Network Analysis rather than Information Science, producing an interesting complementary perspective. The section also includes quantitative studies of the size of the ‘whole’ Web and web server log analysis. The final section, topological modelling and mining of the Web, covers mathematical approaches to modelling the growth of the Web or its internal link structure, mostly the product of Computer Science and Statistical Physics research. It culminates with an exciting new information science contribution to this area, providing detailed interpretations of small-world linking phenomena.

This chapter will be useful for all researchers that either study the Web or a phenomenon that has an online component by reporting directly relevant results or by providing a starting point for the development of new techniques.

1.1Webometrics, Bibliometrics and Informetrics

Being a global document network initially developed for scholarly use (Berners-Lee & Cailliau, 1990) and now inhabited by a diversity of users, the Web constitutes an obvious research area for bibliometrics, scientometrics and informetrics. A range of new terms for the emerging research area have been proposed since the mid-1990s, for instance, netometrics (Bossy, 1995); webometry (Abraham, 1996); internetometrics (Almind & Ingwersen, 1996); webometrics (Almind & Ingwersen, 1997); cybermetrics (journal started 1997 by Isidro Aguillo); web bibliometry (Chakrabarti et al., 2002); web metrics (term used in Computer Science, e.g., Dhyani, Keong & Bhowmick, 2002). Webometrics and cybermetrics are currently the two most widely adopted terms in Information Science, often used as synonyms.

Björneborn & Ingwersen (in press) have proposed a differentiated terminology distinguishing between studies of the Web and studies of all Internet applications. They used an Information Science-related definition of webometrics as “the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the WWW drawing on bibliometric and informetric approaches” (Björneborn & Ingwersen, in press). This definition thus covers quantitative aspects of both the construction side and the usage side of the Web embracing the four main areas of present webometric research: (1) web page content analysis, (2) web link structure analysis, (3) web usage analysis (e.g., exploiting log files for users’ searching and browsing behavior), and (4) web technology analysis (including search engine performance). This includes hybrid forms, for example, Pirolli et al. (1996) who explored web analysis techniques for automatic categorization utilizing link graph topology, text content and metadata similarity, as well asusagedata. All four main research areas include longitudinal studies of changes on the dynamic Web, for example, of page contents, link structures and usage patterns. So-called web archaeology (Björneborn & Ingwersen, 2001) could in this webometric context be important for recovering historical web developments, for instance, by means of the Internet Archive ( an approach already used in webometrics (Björneborn, 2003; Vaughan & Thelwall, 2003; Thelwall & Vaughan, 2004).

Furthermore, Björneborn & Ingwersen (in press) have proposed cybermetrics as a generic term for “the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the whole Internet, drawing on bibliometric and informetric approaches”. Cybermetrics thus encompasses statistical studies of discussion groups, mailing lists, and other computer-mediated communication on the Internet (e.g., Bar-Ilan, 1997; Hernández-Borges et al., 1997; Matzat, 1998; Herring, 2002a) including the Web. Besides covering all computer-mediated communication using Internet applications, this definition of cybermetrics also covers quantitative measures of the Internet backbone technology, topology and traffic (cf. Molyneux & Williams, 1999). The breadth of coverage of cybermetrics and webometrics implies large overlaps with proliferating Computer-Science-based approaches in analyses of web contents, link structures, web usage and web technologies.A range of such approaches has emerged since the mid-1990s with names like Cyber Geography / Cyber Cartography (e.g., Girardin, 1996; Dodge, 1999; Dodge & Kitchin, 2001), Web Ecology (e.g., Chi et al., 1998; Huberman, 2001), Web Mining (e.g., Etzioni, 1996; Kosala & Blockeel, 2000; Chen & Chau, 2004),Web GraphAnalysis (e.g., Chakrabarti et al., 1999; Kleinberg et al., 1999; Broder et al., 2000), and Web Intelligence (e.g., Yao et al., 2001). The raison d'être for using the term webometrics in this context could be to denote a heritage to bibliometrics and informetrics and stress an Information Science perspective on Web studies.

There are different conceptions of informetrics, bibliometrics and scientometrics. The diagram in Fig. 1.1 (Björneborn & Ingwersen, in press) shows the field of informetrics embracing the overlapping fields of bibliometrics and scientometrics following widely adopted definitions by, e.g., Brookes (1990), Egghe & Rousseau (1990) and Tague-Sutcliffe (1992). According to Tague-Sutcliffe (1992), informetrics is “the study of the quantitative aspects of information in any form, not just records or bibliographies, and in any social group, not just scientists”. Bibliometrics is defined as “the study of the quantitative aspects of the production, dissemination and use of recorded information” and scientometrics as “the study of the quantitative aspects of science as a discipline or economic activity” (Tague-Sutcliffe, 1992). In the figure, politico-economical aspects of scientometrics are covered by the part of the scientometric ellipse lying outside the bibliometric one.

In this context, the field of webometrics may be seen as entirely encompassed by bibliometrics, because web documents, whether text or multimedia, are recorded information (cf. Tague-Sutcliffe’s abovementioned definition of bibliometrics) stored on web servers. This recording may be temporary only, just as not all paper documents are properly archived. In the diagram, webometrics is partially covered by scientometrics, as many scholarly activities today are web-based. Furthermore, webometrics is totally included within the field of cybermetrics as defined above.

In the diagram, the field of cybermetrics exceeds the boundaries of bibliometrics, because some activities in cyberspace normally are not recorded, but communicated synchronously as in chat rooms. Cybermetric studies of such activities still fit in the generic field of informetrics as the study of the quantitative aspects of information “in any form” and “in any social group” as stated above by Tague-Sutcliffe (1992).

Figure 1.1 Infor-, biblio-, sciento-, cyber-, and webo-metrics (Björneborn & Ingwersen, in press). The sizes of the overlapping ellipses are made for sake of clarity only.

The inclusion of webometrics expands the field of bibliometrics, as webometrics inevitably will contribute with further methodological developments of web-specific approaches. As ideas rooted in bibliometrics, scientometrics and informetrics have contributed to the emergence of webometrics, ideas in webometrics might now contribute to the development of these embracing fields.

1.2Scope of Coverage and Related Reviews

Although webometrics is still a young research field, there are already many review articles offering partial coverage, including two ARIST chapters. The first is Molyneux’ and Williams’ (1999) “Measuring the Internet”, but so much has subsequently changed that its coverage is unavoidably out of date. The Bibliometrics chapter of Borgman and Furner (2002) explicitly cast web links as structurally equivalent to journal citations, developing a terminology in which both are referred to as links. Web research from inside and outside of Information Science was covered, and there will be many cognitive connections with our chapter - especially with the scholarly communication section - despite a relatively small overlap in articles reviewed. This is because many issues from standard bibliometrics are mirrored in the Web. A special issue of JASIST on webometrics was set to appear in 2004, to contain a wide variety of different approaches to the quantitative study of the Web.

Bar-Ilan and Peritz (2002) have published a review of “Informetric theories and methods for exploring the Internet”, which has a similar scope to ours, but includes non-web Internet research and is more focused on general informetric techniques and does not reflect the recent significant advances in hyperlink analysis in particular. Web metrics are surveyed from a computer science perspective by Dhyani, Keong and Bhowmick. (2002), covering mathematical techniques for measuring a wide variety of online phenomena. Rasmussen’s (2003) ARIST chapter on “Indexing and retrieval for the Web” provides useful additional information on web crawlers and search engines. Our review differs from all of the above in another regard: its primary focus on different types of Web link analysis.

Several general webometrics articles were published in 2003 as well as a coming encyclopedia article on web hyperlink analysis (Vaughan, in press), reflecting a widening recognition of the importance of the topic and the existence of a critical mass of research. The first compared Information Science approaches to studying the Web to those from Social Network Analysis (Park & Thelwall, 2003), finding that Information Scientists emphasized data validation and the study of methodological issues, whereas Social Network Analysts experimented with transferring existing theory to the Web. A second article introduced web-based quantitative methods to social scientists for use in their web research (Wilkinson, Thelwall & Li, 2003). This coincided with the release of a special tool for social science link structure analysis, SocSciBot ( and included a review section as well as a more prescriptive ‘how to do it’ conclusion. Ingwersen’s (1998) Web Impact Factor calculation has also been singled out for detailed coverage by Li (2003), and a general review has also been published in Portuguese (Peres, 2002).

Other review articles cover closely related topics. Search engine research is covered by Bar-Ilan (2004), search engines themselves by Arasu, Cho, Garcia-Molina et al. (2001), and data collection techniques in general by Bar-Ilan (2001) and Thelwall (2002a). Henzinger (2001) reviewed link structure analysis from a computer science perspective, showing how links could be used in search engine ranking algorithms. Barabási (2002) and Huberman (2001) have written popular science books explaining current research into mathematical modeling of the growth of the Web. Web Mining (Chen & Chau, 2004; Kosala & Blockeel, 2000) and Web Intelligence (Yao et al., 2001) are also relevant.

Online communication studies are useful to interpret webometric research. Herring (2002a) gave an overview of Computer Mediated Communication on the Internet, taking a social sciences perspective and focusing mostly on non-web media such as email and newsgroups. One of the key general findings was that the use of new technology was very context-specific, determined by users’ needs rather than the technology. Ellis and Oldridge (2004) explore the use of electronic communication in forming electronic communities. Finholt (2002) reviews the mixed fortunes of collaboratories, which are a kind of electronic virtual laboratory for scientists to share equipment or data. Finally, Kling and Callahan (2004) discuss e-journals and scholarly communication, which gives useful perspectives to our subsection on e-journals.

Our review not only covers more recent research to all of those discussed above, but also has two areas of special emphasis: link analysis in academic web spaces, and basic concepts and methods. Other areas are covered, but less comprehensively, and either draw from or feed into the two main themes.

2Basic Concepts and Methods

This section contains much terminology and many technical details. It is intended to provide a coherent basis for future webometric research and background to the studies reported. Readers are advised to skip to the next section, ‘Scholarly Communication on the Web’ on a first reading and return to this one afterwards.

2.1Terminology

2.1.1Basic link terminology

The initial exploratory phases of an emerging field like webometrics inevitably leads to a variety in the terminology used. For instance, a link received by a web node (the network term ‘node’ here covers a unit of analysis like a web page, directory, web site, or an entire top level domain of a country) has been named, e.g., incoming link, inbound link, inward link, back link, and sitation; the latter term (McKiernan, 1996; Rousseau, 1997) with clear connotations to bibliometric citation analysis. An example of a more problematic terminology is the two opposite meanings of an external link: either as a link pointing out of a web site or a link pointing into a site. We recommend the consistent basic webometric terminology of Björneborn and Ingwersen (in press) for link relations between web nodes, as briefly outlined in Fig. 2.1. The proposed terminology has origins in graph theory, social networks analysis and bibliometrics.

B has an inlink from A
B has an outlink to C
B has a selflink
E and F are reciprocally linked
A has a transversal outlinkto G: functioning as a shortcut
H is reachable from A by a directed link path
I has neither in- nor outlinks; I is isolated
B and E are co-linking to D; B and E have co-outlinks
C and D are co-linked from B; C and D have co-inlinks

Figure 2.1. Basic webometric link terminology (see Björneborn & Ingwersen (in press) for a more detailed legend). The letters may represent different web node levels, for example, web pages, web directories, web sites, or top level domains of countries or generic sectors.

The terms outlink and inlink are commonly used in Computer Science-based Web studies (e.g., Pirolli et al., 1996; Chen et al., 1998; Broder et al., 2000). The term outlink implies that a directed link and its two adjacent nodes are viewed from the source node providing the link, analogous with the use of the term reference in bibliometrics. A corresponding analogy exists between the terms inlink and citation, with the target node as the spectator’s viewpoint. In the conceptual framework of Björneborn and Ingwersen (in press), a link crossing a web site border, like link e in figure 2.2 below, is called a site outlink or a site inlink depending on the perspective of the spectator.

On the Web, selflinks are used for a wider range of purposes than self-citations in scientific literature. Page selflinks point from one section to another within the same page. Site selflinks (also known as internal links) are typically navigational pointers from one page to another within the same site. Most links on the Web connect web pages containing cognate topics (Davison, 2000). However, some links may break a typical linkage pattern in a web node neighborhood and connect dissimilar topical domains. Such (loosely defined) transversal links (Björneborn, 2001, 2003) function as cross-topic shortcuts and may affect so-called small-world phenomena on the Web (cf. the section below on small worlds).