A very in-depth look at how Googles Search Algorithm works..

This Article explains in detail how the algorythm goes about making sense of the VAST and unorganized World Wide Web – http://www.ams.org/samplings/feature-column/fcarc-pagerank

Here is an exerpt:

How to tell who’s important

If you’ve ever created a web page, you’ve probably included links to other pages that contain valuable, reliable information. By doing so, you are affirming the importance of the pages you link to. Google’s PageRank algorithm stages a monthly popularity contest among all pages on the web to decide which pages are most important. The fundamental idea put forth by PageRank’s creators, Sergey Brin and Lawrence Page, is this: the importance of a page is judged by the number of pages linking to it as well as their importance.

We will assign to each web page P a measure of its importance I(P), called the page’s PageRank. At various sites, you may find anapproximation of a page’s PageRank. (For instance, the home page of The American Mathematical Society currently has a PageRank of 8 on a scale of 10. Can you find any pages with a PageRank of 10?) This reported value is only an approximation since Google declines to publish actual PageRanks in an effort to frustrate those would manipulate the rankings.

Here’s how the PageRank is determined. Suppose that page Pj has lj links. If one of those links is to page Pi, then Pj will pass on 1/ljof its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. That is, if we denote the set of pages linking to Pi by Bi, then

[  I(P_i)=sum_{P_jin B_i} frac{I(P_j)}{l_j}  ]

This may remind you of the chicken and the egg: to determine the importance of a page, we first need to know the importance of all the pages linking to it. However, we may recast the problem into one that is more mathematically familiar.

Let’s first create a matrix, called the hyperlink matrix, $ {bf H}=[H_{ij}] $ in which the entry in the ith row and jth column is

[  H_{ij}=left{begin{array}{ll}1/l_{j} &  	hbox{if } P_jin B_i \ 	0 & hbox{otherwise} 	end{array}right.  ]

Notice that H has some special properties. First, its entries are all nonnegative. Also, the sum of the entries in a column is one unless the page corresponding to that column has no links. Matrices in which all the entries are nonnegative and the sum of the entries in every column is one are called stochastic; they will play an important role in our story.

We will also form a vector $ I=[I(P_i)] $ whose components are PageRanks–that is, the importance rankings–of all the pages. The condition above defining the PageRank may be expressed as

[  I = {bf H}I  ]

In other words, the vector I is an eigenvector of the matrix H with eigenvalue 1. We also call this a stationary vector of H.

Let’s look at an example. Shown below is a representation of a small collection (eight) of web pages with links represented by arrows.

The corresponding matrix is

with stationary vector

This shows that page 8 wins the popularity contest. Here is the same figure with the web pages shaded in such a way that the pages with higher PageRanks are lighter.

Want the whole story? Go to – http://www.ams.org/samplings/feature-column/fcarc-pagerank

Return to Main Menu