hide
Free keywords:
-
Abstract:
The World Wide Web has become a key source of knowledge
pertaining to almost every walk of life. Unfortunately,
much of data on the Web is highly ephemeral in nature,
with more than 50-80% of content estimated to be changing
within a short time. Continuing the pioneering efforts of
many national (digital) libraries, organizations such as the
International Internet Preservation Consortium (IIPC), the
Internet Archive (IA) and the European Archive (EA) have
been tirelessly working towards preserving the ever changing
Web.
However, while these web archiving efforts have paid significant
attention towards long term preservation of Web
data, they have paid little attention to developing an globalscale
infrastructure for collecting, archiving, and performing
historical analyzes on the collected data. Based on insights
from our recent work on building text analytics for Web
Archives, we propose EverLast , a scalable distributed framework
for next generation Web archival and temporal text
analytics over the archive. Our system is built on a looselycoupled
distributed architecture that can be deployed over
large-scale peer-to-peer networks. In this way, we allow the
integration of many archival efforts taken mainly at a national
level by national digital libraries. Key features of
EverLast include support of time-based text search & analysis
and the use of human-assisted archive gathering. In this
paper, we outline the overall architecture of EverLast, and
present some promising preliminary results.