E. Selberg and O. Etzioni
Summary
This paper presents the architecture and capabilities of the MetaCrawler search service. MetaCrawler is unique since it is able to query many different search engines, combining their responses into the highest ranked pages overall. The difficulties include generating the appropriate query for the particular search engine and then parsing the results, which have a variable format for every search engine. The ranking algorithm is simply a sum over the total number of page occurrences weighted by rank given by each search engine.
This paper is slightly old, but the concept of using multiple search engines is sound. This idea is used in a variety of search engines today. Experimental evidence was not directly provided by the paper, merely implied to be efficient and precise. It would be interesting to compare the MetaCrawler results against the Google search engine.
Keywords
MetaCrawler, parallel web search, information retrieval, confidence score, Harness, scalability
Methods
A confidence score is used to determine how close a page matches a query, and this measure is based on the ranking of that page by a search engine combined with its frequency across search engines. The main architectural unit is the Harness, which is able to generate the search engine specific query and parse the results. The translation from search engine results to the Harness data structure is straightforward HTML parsing, with optimizations for each of the search engines.
Rating
6
Bibtex Entry
@article = { selberg97,
author = "E. Selberg and O. Etzioni",
title = "The MetaCrawler Architecture for Resource Aggregation on the Web",
journal = "IEEE Expert",
number = "January--February",
pages = "11--14",
year = "1997",
url = "citeseer.nj.nec.com/selberg97metacrawler.html"
}