How Did the Developers of Google™ Rely on Discrete Mathematics?

Gerald Kruse

JuniataCollege

Huntingdon, PA USA

The recent media attention during its Initial Public Offering served to underscore Google’s position as the most ubiquitous search engine. The main reason for Google’s popularity is its uncanny ability to provide useful search results. Although there have been many attempts to manipulate these search results, including the well known “Google Bombs”[1],it is safe to assume that most people are not aware that Google’s PageRank is determined using linear algebra and graph theory.

When considering how to rank the importance, or PageRank, of web pages, the developers of Google decided to exploit the link structure of the World Wide Web. Rather than just ranking a page solely on its number of backlinks (number of pages pointing to it), the developers wanted to consider the popularity of the backlinks as well. For example, as Page, Brin, et. al. explain in [2], if a page has just one backlink, but it is from the Yahoo home page, then it should have a higher page rank than another page which has several backlinks which are from obscure sites. This leads to their intuitive definition:“A page has high rank if the sum of the ranks of its backlinks is high.”

Let’s convert the above to a mathematical definition. We begin by letting be a web page, be the simple PageRank of , be the set of pages points to, and be the set of pages pointing to . This leads to, or the number of links from, and the intuitive definition of PageRank becomes .

To demonstrate, we will calculate PageRank in an example. We start with the following directed graphcontaining three web pages, where a directed edge from one page (vertex) to another represents a link.

.

This system of equations can be written in matrix notation as: . In this formulation it is obvious that the PageRank vector is simply an eigenvector of the coefficient matrix. Solving and normalizing gives PageRanks of 0.4, 0.2, and 0.4 for pages A, B, and C, respectively. C has backlinks from both A and B, so it is a popular page. A has only one backlink, but it is from a popular page, so it has a high rank too. However, B has one backlink from a popular page, why doesn’t it have a high rank? Well, , so A’s popularity is spread over two links.

With the billions of pages on the web to rank, the eigenvector calculation could be problematic. Thankfully, from the definition of , all entries in the coefficient matrix will be between 0 and 1inclusive, so it is stochastic[3]. In addition, the developers added a random-surfer term to their PageRank formula [2], so it becomes . This term guards against the rank propagation getting caught in a “rank-sink” endless loop, and it insuresthat the coefficient matrix is regular, which allows for a computationally easier iterative eigenvector calculation.

References:

[1]

[2]Page, Brin, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project.

[3]Poole, David. Linear Algebra, A Modern Introduction. Brooks-Cole, 2003.

Acknowledgements:

The author would like to thank Henry Walker for distributing [2] at the 2003 MAA-PREP Workshop on Discrete Mathematics.