J. Rennie and A. McCallum
Summary
This paper presents an algorithm with reinforcement learning that uses context to choose which links a spider should follow. The context is the neighborhood text around a given query, and links closer to a query phrase are followed first. Links that include the query or are very close in the neighborhood text are defined to provide 'immediate reward', or are the most likely to lead to a hit page. Experimental evidence against a regular breadth-first spider show significant improvements in number of hit pages retrieved.
The primary contribution of this paper is the concept of an immediate reward. Following links that lead to more relevant pages was shown by an earlier google paper to be more useful in future queries, so this approach is immediately justified. In addition, my own research with spiders that view the neighborhood text around a link have provided additional information about the utility of this approach.
Keywords
reinforcement learning, spiders, amoritzed cost, naive bayes, immediate reward, neighborhood text, future rewards,
Methods
The spidering algorithm selects a link based reinforcement learning and the context around the link. The spider itself learns over time by maintaining a current state value (sum of rewards minus discounts over the path followed so far) and using a learning function to select a link. The learning functionis the sum of the reward of the current state (i.e. if the current page is relevant) plus the potential value of the link. The potential value is a combination of the context surrounding the link and the transition cost.
Assumptions
Rating
8
Bibtex Entry
@proceedings{ rennie99,
author = "J. Rennie and A. McCallum",
title = "Efficient Web Spidering with Reinforcement Learning",
text = "In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), 1999.",
year = "1999",
url = "http://citeseer.nj.nec.com/context/1239690/0"
}