Evaluating topic-driven web crawlers

Filippo Menczer

Back to index

Summary

This paper identifies criteria to evaluate topic-driven web crawlers. The normal criteria crawlers are evaluated against are recall and precision, but the author points out that against the entire web, this criteria is unsuited due to the infinite supply of pages. These ratios are not accurate for the scale a crawler is designed for. The ideal measurement would be a live user indicating relevant pages, but this is not feasible. During the crawl, agents are able to identify how important a page is and a summary analysis of the crawl success so far. A page's importance breaks down into two measures: link and similarity based importance. The summary analysis is a ratio of relevant pages retrieved over the entire set of pages retrieved. Three different assessment measures were used to captures these page importance and summary analysis. Assessment with classifiers uses the Yahoo topics to classify pages into categories, and then the newly retrieved page is immediately classified. Assessment with a retrieval system ranks the crawled pages against the topics, using the SMART system. The final assessment uses mean topic similarity, which is a measurement of proximity to the topic in context vector space.

Experimental results show consistently that the BestFirst crawler outperforms InfoSpiders, which outperforms PageRank. These results are a little surprising, since InfoSpiders was expected to generate a population of agents that honed in on a particular topic, but lends credibility to the simple BestFirst agent which uses only the topic itself to help guide its search. The consistency between the results indicates their individual robustness as a measurement, and the experiment provides a sound platform from which future crawlers may be evaluated against.

Keywords

information retrieval, link analysis, evaluation, experimental design and metrics, InfoSpiders, PageRank, BestFirst, SMART

Methods

The crawlers were tested using their basic algorithms under constraints such as a max_pages defined. The classifier assessment used the Widrow-Hoff, Exponential Gradient, and Rocchio classifiers, which resulted in a "average fraction of true positives" measure for each crawler. The SMART assessment measured the average number of pages found over the number of highest ranked pages, and the mean similarity assessment showed the mean distance from topic (in context vector space) over pages retrieved. The consistency between these measurements indicate their utility for this problem.

Rating

7

Bibtex Entry

@inproceedings = { menczer01,

author = "Filippo Menczer",

title = "Evaluating topic-driven web crawlers"

booktitle = "ACM SIGIR 2001, to appear",

url = "http://dollar.biz.uiowa.edu/~fil/Papers/sigir-01.pdf"

}

Back to index