hide
Free keywords:
-
Abstract:
We consider a collaboration of peers autonomously crawling the Web.
A pivotal issue when designing a peer-to-peer (P2P) Web search engine in this
environment
is \textit{query routing}: selecting a small subset of (a potentially very
large number of relevant)
peers to contact to satisfy a keyword query. Existing approaches for query
routing work well
on disjoint data sets. However, naturally, the peers' data collections often
highly overlap,
as popular documents are highly crawled. Techniques for estimating the
cardinality of the
overlap between sets, designed for and incorporated into information retrieval
engines
are very much lacking. In this paper we present a comprehensive evaluation of
appropriate
overlap estimators, showing how they can be incorporated into an efficient,
iterative approach
to query routing, coined \textit{Integrated Quality Novelty (IQN)}. We propose
to further
enhance our approach using histograms, combining overlap estimation with the
available
score/ranking information. Finally, we conduct a performance evaluation in
MINERVA,
our prototype P2P Web search engine.