Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson and Jon Kleinberg
Summary
The primary contribution of this paper is the 'automatic resource compiler' (ARC) algorithm for ranking web pages. The notion of a 'hub' and a 'authority' page are introduced, which refer to a page with many relevant links, or contain directly relevant information, respectively. First, pages are indexed through the use of a spider off-line. A user may enter a query, and the algorithm sorts through the links, incrementing the overall score of a page when it fits into either the hub or authority category. Experimental results do not show statistically significant performance over yahoo, but ARC performed comparably.
The experimental results that perform comparable to Yahoo indicate good algorithmic performance, but the primary value of the paper is in the identification of the hub and authoritative pages. This classification is useful across many IR subtopics, and the algorithm itself may be built upon for future work.
Keywords
taxonomies, link analysis, anchor text, information retrieval
Methods
First, a 'root set' of 200 documents are retrieved as a starting point from the AltaVista search engine. The neighbors (forward and backlinks) to this set of documents are found and added to the set. A hub and authority score for each page is maintained, based on the number of links from a page and to a page, respectively. Then the algorithm iterates over the pages, normalizing and appropriately weighting the pages based on the actual content of the pages. Two vectors, hub and authority for the query words are generated. Pages are pairs with their links to form a matrix, where (p,q) is an entry if p links to q. The inverse of this matrix is calculated, and the two resulting matrices these are multiplied by the hub and authority vectors, for k iterations. The resulting vectors converge (typically around k = 5), determining the final ranking.
Rating
7
Bibtex Entry
@article = { chakrabarti98,
author = "Soumen Chakrabarti and Byron Dom and Prabhakar Raghavan and Sridhar Rajagopalan and David Gibson and Jon Kleinberg",
title = "Automatic resource compilation by analyzing hyperlink structure and associated text",
journal = "Computer Networks and ISDN Systems",
volume = "30",
number = "1--7",
pages = "65--74",
year = "1998",
url = "citeseer.nj.nec.com/chakrabarti98automatic.html"
}