Efficient crawling through URL ordering

Junghoo Cho, Hector Garcia-Molina and Lawrence Page

Back to index

Summary

This paper presents an algorithm used to sort the pages a crawler should visit, based on the potential importance of a page. Potentially important pages are defined to be pages with many links leading to them (backlink count), as well as pages that contain words in a given topic, and location relative to a given server. The intuition is that given a finite amount of time, a higher payoff in relevant pages is accomplished by following pages with a higher backlink count.

This crawler serves as the foundation crawler for Google. The novel concept is the backlink count, and when used in conjunction with the PageRank algorithm is able to return the most relevant results to a query. Experimental results prove that the backlink count crawler is able to retrieve higher relevancy pages than the other experimental crawlers.

Keywords

crawler, backlink, PageRank, limited buffer, WebBase, link potential

Methods

Pages are ranked according to three different measures: 'similarity to a driving query', backlink count, and the PageRank number itself. The similarity measure is a result of a vector of the query words ({0,1} based on whether the word appears in the document) dot product with the inverse document frequency for each word. The backlink count is the number of links to the page that have appeared already in the index, which is stored in a Repository database. The PageRank metric is the weighted sum of the backlinks to the page combined with the characteristics of the query words as they appear in the page context (one bit if the word is in the title, in bold, in caps, etc).

Rating

6

Bibtex Entry

@article = { cho98,

author = "Junghoo Cho and Hector Garc{\'\i}a-Molina and Lawrence Page",

title = "Efficient crawling through {URL} ordering",

journal = "Computer Networks and ISDN Systems",

volume = "30",

number = "1--7",

pages = "161--172",

year = "1998",

url = "citeseer.nj.nec.com/cho98efficient.html"

}

Back to index