Sergey Brin and Lawrence Page
Summary
This paper outlines the foundations of the Google search engine, describing the critical components of a search engine through the tasks of crawling, indexing, sorting, and completing with the algorithm used to generate query results. Brin and Page focus on the PageRank algorithm which sorts relevant pages for the user, using a database of the pages and tools such as the Connectivity Server to generate neighborhood listings. The PageRank algorithm generates a numerical score for the page by weighting the hit list (bits corresponding to characteristics of the query, such as capitalization), the number of links to the page, and number of occurrences of the word.
Google is the strongest search engine on the web, and this is primarily due to the effectiveness of the PageRank algorithm. This paper presents the algorithm in enough detail to reproduce the results, which significantly validate the approach. It is a well written, complete paper, which highlights all the critical aspects of a search engine.
Keywords
search engine, information retrieval, PageRank, Google, distributed crawler, hit list, forward index
Methods
The Google search engine accomplishes three major tasks: Crawling, Indexing, and Sorting. The Crawling algorithm is based on following links based on the backlink count to those pages, but is not detailed in this paper (see cho98). Google indexes pages in several large databases, including Repository (cache of full html for each page), document index (document ID, document stats), lexicon database (english language database), hit list (word statistics for the pages), forward index (partially sorted word ID to map to docIDs), and inverted index (wordID to word lexicon). Sorting and ranking the pages are accomplished real time by the PageRank algorithm, which counts word occurrences in the hit list and forward index, as well as the number of backlinks to the particular page (see also cho98)..
Rating
9
Bibtex Entry
@article = { brin98,
author = "Sergey Brin and Lawrence Page",
title = "The anatomy of a large-scale hypertextual {Web} search engine",
journal = "Computer Networks and ISDN Systems",
volume = "30", number = "1--7",
pages = "107--117",
year = "1998",
url = "citeseer.nj.nec.com/brin98anatomy.html"
}