ausblenden:
Schlagwörter:
-
Zusammenfassung:
Information retrieval (IR) in peer-to-peer (P2P) networks,
where the corpus is spread across many loosely coupled peers,
has recently gained importance. In contrast to IR systems
on a centralized server or server farm, P2P IR faces the
additional challenge of either being oblivious to global
corpus statistics or having to compute the global measures
from local statistics at the individual peers in an efficient,
distributed manner. One specific measure of interest is the
global document frequency for different terms, which would be
very beneficial as term-specific weights in the scoring and
ranking of merged search results that have been obtained
from different peers.
This paper presents an efficient solution for the problem of
estimating global document frequencies in a large-scale P2P
network with very high dynamics where peers can join and leave
the network on short notice. In particular, the developed method
takes into account the fact that the local document collections
of autonomous peers may arbitrarily overlap, so that global counting
needs to be duplicate-insensitive. The method is based on hash sketches
as a technique for compact data synopses. Experimental studies
demonstrate the estimator's accuracy, scalability,
and ability to cope with high dynamics. Moreover, the benefit for
ranking P2P search results is shown by experiments with real-world Web data and
queries.