Robert C. Miller and Krishna Bharat
Summary
This paper presents an overview of the SPHINX web-crawling system. SPHINX is able to take a specific query with many parameters such as time, depth, and type of search, and crawl over a specific website or across the entire web. This technical design document provides an overview of the operations that SPHINX can perform as well as a description of the underlying algorithms in the open-source code.
SPHINX's primary contribution is in its customizable search query and attractive user interface allowing the user to poll information on many different factors about the crawl itself. SPHINX is generally applicable to a variety of searches, but its primary drawback is in its memory usage. The program is written with inefficient Java, which is a bad combination for lengthy web searches. Otherwise, the code is solid on a well designed foundation.
Keywords
crawlers, robots, spiders, web automation, web searching, java, end-user programming , mobile code
Methods
SPHINX is driven by its priority queue selection of which links to follow, based on user requirements. The code itself provides the foundation for the user to enter in search criteria, otherwise SPHINX defaults to breadth-first search. The user may enter in heuristics for pages to be ranked into a priority queue for immediate selection, and then SPHINX indicates a hit when a match to the query phrase is found. SPHINX keeps track of links to a from a page through internalization of the graph visited so far, which is stored in cache and rather memory intensive. SPHINX provides a solid foundation upon which customized crawlers may be built.
Rating
7
Bibtex Entry
@article = { miller98,
author = "Robert C. Miller and Krishna Bharat",
title = "SPHINX: a framework for creating personal, site-specific Web crawlers",
journal = "Computer Networks and ISDN Systems",
volume = "30",
number = "1--7",
pages = "119--130",
year = "1998",
url = "http://citeseer.nj.nec.com/context/987919/0"
}