The Term Vector Database: fast access to indexing terms for web pages

Raymie Stata, Krishna Bharat and Farzin Maghoul

Back to index

Summary

This paper proposes a database of term vector information for web pages, which is used by the Google search engine. The database maps web page IDs to terms and their weights, which is very efficient for recall. The database is used in conjunction with a 'Connectivity Server', which quickly maps the links around a particular page to be passed to a spider for crawling. The underlying theory is that the most relevant pages for a given query topic are highly connected within a subgraph of the web. This provides the foundation for a ranking algorithm for a given query.

The success of Google highlights the efficiency of this approach. This paper builds on the list of papers from Kristna Bharat, which include the Google PageRank algorithm description. The approach is counter-intuitive, since generally the database mapping goes from terms to page ids. The speedup of this reverse mapping contributes to the proven success of Google.

Keywords

page classification, term vectors, topic distillation, web connectivity, web search

Methods

The Google Term Vector database maps page IDs to term vector information. The term vector is a sequence of term-weight pairs, which are selected for inclusion in the database if they score above a threshold on the term frequency x inverse document frequency measurements. Topic vectors are also identified to be sequences of related words, which helps with ranking pages since closely related topics tend to be clustered together. This is called topic distillation by the authors. Category vectors are also stored, which correspond to Google's high-level categories at directory.google.com. As pages are crawled, they are sorted into this database (amoung others, see brin98), and then when a user enters a query an inverse database of this maps query keywords to the pages, then back again for ranking information.

Rating

8

Bibtex Entry

@proceedings{ stata00,

author = "Raymie Stata and Krishna Bharat and Farzin Maghoul",

title = "The Term Vector Database: fast access to indexing terms for web pages ",

text = "WWW9 Conference, 2000.",

year = "2000",

url = "http://www9.org/w9cdrom/159/159.html"

}

 

Back to index