Regularities in the Organization of Web Content

Scatter Matters: Information Scatter and its Implications for Web Searchers

Suresh K. Bhavnani, Frederick A. Peck

University of Michigan

Ann Arbor, MI48109

Phone: 734-615-8281

Fax: 764-2475

Abstract. Despite the development of extensive websites with high-quality information, and the use of powerful search engines, many users find it difficult to find comprehensive information. One reason for this difficulty is that information, even for narrowly well-defined topics, is highly scattered across websites with no page or site containing all the relevant information. This high scatter can be explained by the existence of three page profiles that vary in information density, each of which plays a distinct role in the presentation of information. Such information scatter has important implications for searchers and for the design of systems to help find comprehensive information.

The web has spawned the development of extensive websites in domains such as healthcare and e-commerce. For example, the National Cancer Institute’s website contains thousands of pages with information about 118 cancers. Given such large collections of information, one might conclude that it is easy to obtain comprehensive information about a topic like cancer by visiting one such website. However, despite the use of powerful search engines (which aim to locate such extensive sites), many web searchers have difficulty finding comprehensive information about a topic (1). Finding comprehensive information about a topic is critical because an increasing number of people use information from the web leading to real-world outcomes. For example, an estimated half of all American adults have searched online for healthcare information to become informed, to prepare for appointments and surgery, and to share information (2).

Why is it difficult to find comprehensive information about a topic? Our research suggests that the difficulty is caused because information on the web, even for narrow well-defined topics, tends to be scattered across different websites. For example, a recent study (3) showed that while physicians independently agreed that patients need to know 14 facts[1](e.g. Having fair skin increases your risk of getting melanoma) about melanoma risk and prevention, none of the top-10 websites with melanoma information provided all those facts.

Page 1

An analysis of the distribution of facts across relevant web pages from the above top-10 sites revealed the precise nature of the scatter. As shown in Figure 1, there were a large number of pages that contained very few facts, a few pages that contained many facts, and no page that contained all the facts. Because the above results were replicated across different topics in two domains (3, 4), the scatter of facts across online sources appears to be a general phenomenon. The results also suggest that web searchers might be ending their searches prematurely (5) because they find many pages with few facts (given their relatively large numbers) and conclude that they have found all relevant facts about their search topic. While such a distribution conforms to other information-related phenomena (e.g. 6, 7, 8, 9), the distribution of facts across pages has received little attention.

Why are there so many pages with few facts? After all, the above pages belong to the top-10 websites with melanoma information. Our new findings help to answer this question, and lead to a new dimension of organization of information on the web. Analysis of the page content revealed that the pages not only differed in the number of facts (as shown in Figure 1), but they also differed in the amount of information about each fact. For example, some pages contained a single fact covering most of the page, while other pages contained many facts in little detail. These page profiles suggest that the number of facts (fact-breadth), and the amount of information about each fact (fact-depth) were important dimensions that could explain the distribution shown in Figure 1.

Page 1

A cluster analysis (10) of the pages, based on fact-breadth and fact-depth, revealed how the two dimensions interacted to create different page profiles. As shown in Figure 2, the analysis identified three clusters of pages: (1) The top cluster (denoted by “▲”), contains pages with a few facts one of which is described in high detail. These pages were labeled specific pages for the topic. (2) The lower left hand cluster (denoted by “+”) contains pages with relatively few facts in low-to-medium detail. These pages were labeled sparse pages for the topic. (3) The right hand cluster (denoted by “”) contains pages with many facts in low-to-medium detail. These pages were labeled general pages for the topic. The figure also shows that there is a higher percentage (74%) of specific and sparse pages (which contain relatively few facts), compared to general pages (which contain many facts)[2].

Why are facts provided in different densities across pages? A content analysis (10) revealed that each page type played a distinct role in providing information. As shown in Figure 3A and 3B, general pages (which provide many facts in low to medium detail) serve as broad overviews of topics, while specific pages (which provide few facts in high detail) serve as detailed discussions about a particular fact. The general and specific pages therefore follow the classic trade-off between depth and breadth (11, 12) observed in other domains. This trade-off in number and detail of facts suggests a process through which they are generated. Web page authors might be following a rich-gets-richer (13) phenomenon to progressively add facts in detail to a page, until a length and detail threshold is reached. At such a threshold authors might be creating new pages to elaborate particular facts in high detail, while abstracting detail from the general pages to make them shorter and more readable. Such a process would lead to the creation of a large number of specific pages, while constraining the total number of general pages.

However, the above process does not explain the existence of sparse pages. Why would authors create pages that have few facts described in low detail? On the surface such pages appear to be of low relevance, and therefore not useful to the search task. An analysis (10) revealed that sparse pages contained high-quality information about related topics (e.g., skin surgery), while also providing a brief mention of a fact about melanoma risk and prevention. Figure 3C shows an example of a sparse page. Such pages are valuable in a comprehensive search because they show the relationship between the topic being searched, and broader topics. Furthermore, because sparse pages were the most dominant page-type (comprising 53% of the data), they also play a critical role in the skew of the distribution towards fewer facts. The above three page profiles have been replicated across four other topics, and therefore appear to be a general phenomenon.

Similar to the regularities in how pages link to each other (13, 14), the analysis of information scatter provides a new dimension of organization on the web, with important implications for searchers and for the design of search systems. The analysis of information scatter suggests that users looking for comprehensive information should follow a general-specific-sparse search strategy, where they first read a few general pages (to get an overview of all the facts), followed by specific pages (to get detailed information about specific facts), followed by sparse pages (to understand how the topic being searched might be connected to related topics). While such an approach is often recommended by expert searchers (1), a preliminary analysis showed that it was difficult for users to follow the general-specific-sparse strategy within sites just by following links. This is because general, specific, and sparse pages are not systematically linked.

Because neither websites nor current search systems assist users to follow the general-specific-sparse search strategy, we have developed a prototype system that provides such a strategy to novice searchers. A recent study showed that such an approach helps users find more comprehensive information about a topic compared to equivalent searchers who use conventional search methods (1). Our current research involves the development of an information density algorithm that automatically determines the fact-depth and fact-breadth of a page, and uses this information to categorize pages into general, specific, and sparse. This algorithm will be used to automatically generate portals that guide searchers to use the general-specific-sparse search strategy. Regularities in the way that information is scattered on the web have therefore suggested novel approaches that help users find more comprehensive online information.

References and Notes

S. K. Bhavnani et al., In Proceedings of the conference on human factors in computing systems 2003, (ACM, New York, NY, 2003) pp. 393-400.
S. Fox, L. Rainie, In Pew Internet and American live project (available at (2002).
S. K. Bhavnani, J. Am. Soc. Info. Sci. and Tech., in press.
S. K. Bhavnani, Automation in Const., in press.
S. K. Bhavnani, in NIST Special Publication 500-250: The Tenth Text Retrieval Conference, E. M. Voorhees, D. K. Harman, Eds. (NIST, Washington, DC, 2001), pp 571-578.
M. Bates, in Emerging Frameworks and Methods: Proceedings of the Fourth International Conference on Conceptions of Library and Information Science, H. Bruce, R. Fidel, R. Ingwersen, P. Vakkari, Eds. (Libraries Unlimited, Greenwood Village, CO, 2002), pp. 137-150.
S. C. Bradford, Documentation. (Crosby Lockwood, London, 1948).
B. A. Huberman, L. Adamic, Nature401, 131 (1999).
B. A. Huberman, P. L. Pirolli, J. E. Pitkow, R. M. Lukose, Science280, 95 (1998).
Materials and methods are available as supporting material on Science Online
K. Larson, M. Czerwinski, In Proceedings of the conference on human factors in computing systems 1998, (ACM, New York, NY, 1998) pp. 25-32.
A. Woodruff, J. Landay, M. Stonebraker, In Proceedings of Advanced Visual Interfaces ’98 (1998) pp. 57-65.
A-L. Barabasi, R. Albert, Science286, 509 (1999).
J. Kleinberg, S. Lawrence, Science294, 1849 (2001).

Acknowledgements

The research was funded in part by NSF Award #EIA-9812607. We thank T. Finholt, R. Little, A. Rao, R. Thomas, F. Reif, and G. Vallabha for their contributions.

One Line Summary

The high scatter of online information can be explained by the existence of three distinct page profiles, which have implications for searchers and for the design of systems to help find comprehensive information.

Page 1

Supporting online material

Identification of page profiles

Method. To identify the page profiles, we used cluster analysis to automatically cluster the webpages[3] according to depth and breadth of fact coverage. Both these measures were derived from our previous study (1) where two raters were asked to independently rate on each page the amount of information of each fact using a 5-point scale: 0=Fact not covered on page, 1=Fact covered in less than one paragraph, 2=Fact covered in one paragraph, 3=Fact covered in more than one paragraph, 4=Webpage mostly devoted to fact, although other facts could also be covered on the same page[4]. The raters had high agreement on whether or not a fact was present in a page, and the extent to which the fact was covered on that page (1). For the cluster analysis in the current study, fact-breadth of a page was defined as the total number of facts for a topic that occurred on that page. Fact-depth of a page was defined as the maximum amount of information (max-detail) of any relevant physician-identified fact on that page.

The above analysis was conducted on 728 unique webpages related to five search topics (1). Below we describe the steps used in the cluster analysis for one of the five topics, which involved the detailed analysis of 189 unique webpages about melanoma risk/prevention. The analyses for the other four topics produced results that were similar to those shown below.

The cluster analysis was done in the following two steps:

1. Estimate number of clusters. We used the Minimum Message Length[5] (MML) criterion (2) to estimate the optimum number of clusters based on fact-depth and fact-breadth. Pages with no facts were dropped from the analysis because although they contained the query terms (e.g. melanoma risk), they did not contain any of the facts that were necessary for a comprehensive understanding of the topic. Thus, these pages were not relevant to the study of how relevant facts were distributed across pages. This resulted in 105 pages for the topic melanoma risk/prevention. Because MML requires interval-level inputs, and our fact-depth scale was an ordinal variable, we converted each value in the fact-depth scale to its corresponding mean number of words. To do the mapping, we randomly selected 25% of the pages, evenly distributed across levels of max-detail and topic, and averaged the number of words that described the relevant fact with the maximum detail. The resulting mapping was as follows: max-detail=1 mapped to 23.93 words, max-detail=2 mapped to 66.07 words, max-detail=3 mapped to 119.73 words, and max-detail=4 mapped to 513.57 words. The MML algorithm was run for 1-5 clusters.

2. Identify boundaries of clusters. We used the K-means algorithm (SPSS, version 11.5) to determine the cluster boundaries. Inputs to K-means were the same two variables used for MML (fact-depth based on mean number of words and fact-breadth), with number of clusters provided by MML.

Results. The lowest MML value was obtained for three clusters, meaning that three clusters best characterized the data. This result was used to determine the subsequent cluster boundaries using K-means. Figure 2 in the paper shows the results from the K-means cluster analysis. In that figure the clusters are plotted with the Y-axis showing the original fact-depth ordinal scale. Figure S1 shows the same cluster results, but with the Y-axis mapped to the mean number of words for each level of fact-depth. As discussed above, the later were the inputs that were used for MML and K-means. Thus, Figure 2 shows the clusters on the original ordinal scale used by the raters, while Figure S1 shows the same clusters on the interval scale that was used as input to MML and K-means.

Similar results were obtained for four other topics related to melanoma, demonstrating the generality of the results. A cluster analysis of the five topics collapsed (and with fact-breadth normalized as the number of facts differed across the topics), revealed cluster boundaries that were similar to those shown in Figure 2: Sparse pages had fact-breadth<=40% of the total possible facts, and fact-depth=1-3. General pages had fact-breadth>40% and fact-depth=1-3, and specific pages had fact-breadth=0-80%, and fact-depth=4.

Understanding the Nature of Sparse Pages

While the rationale for both specific and general pages was intuitively clear, we did not immediately understand the rationale for sparse pages. Why would authors of high-quality websites create pages that covered few facts in low detail? To answer this question, we analyzed the relationship between page profile and the granularity of the topic (e.g. cancer vs. melanoma) contained within the page. Instead of using the contents of the entire page to determine topic granularity, we used the title of the page because: (1) we wished to avoid circularities with the other measured variables (fact-depth and fact-breadth) which were based on the content of the page, and (2) webpage authors typically design titles that reflect the content of their pages.

Method. The title of each page was typed into a separate document (to avoid bias from the page content). Topic granularity was operationalized in terms of 7 nodes in a hierarchical skin-cancer taxonomy (developed by skin-cancer physicians based on the analysis of real-world skin cancer questions (3)). A rater was asked to categorize each page title in terms of nodes in the hierarchical skin cancer taxonomy, using the following scheme: 1=cancer, 2=skin cancer, 3=melanoma, 4=risk/prevention, 5=descriptive information about risk/prevention, 6=fact about risk/prevention (e.g. high UV exposure), and 7=other categories outside the above categories (e.g. non-cancer skin problems such as acne).

Results. Figure S2 shows the proportion page titles rated at the 7 different topic granularities in each page profile. As shown, sparse pages had the largest proportion (69%) of titles that were rated as “other” (shown in black). This analysis therefore suggests that sparse pages are not “poor” or “irrelevant” pages. Rather, sparse pages are written for topics that are distantly related to the topic (in this case melanoma risk/prevention) being searched, but which also include some information relevant to the topic.

References

S. K. Bhavnani, J. Am. Soc. Info. Sci., in press.
M.A.T. Figueiredo, and A.K. Jain, IEEE Trans. Pattern Anal. Machine Intel.24, 381.
S.K. Bhavnani, et al., In Proceedings of the AMIA 2003 Annual Symposium, M. Musen, Ed., (2002) pp 81-85.

Page 1

[1] A fact is here defined as a statement about a topic, agreed upon by experts in the field. For example, facts can be claims (e.g. having fair skin increases your risk of getting melanoma) or recommendations (e.g. confirm your self-diagnosis by consulting a local health care provider).

[2] The distinction between sparse and general pages is less clear near the boundary of the two page-types (at 5-6 facts). However, there is a clear difference between an archetypical sparse page with few facts and low detail (e.g., (3, 2)) and an archetypical general page with many facts in medium detail (e.g., (8, 3)).

[3] See (1) for a detailed description of how the top-10 websites for melanoma information were identified and how relevant webpages for each topic were retrieved and analyzed from those websites.