Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society


Scalable phrase mining for ad-hoc text analytics

Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard

MPI-I-2009-5-006. April 2009, 41 pages. | Status: available - back from printing | Next --> Entry | Previous <-- Entry

Abstract in LaTeX format:
Large text corpora with news, customer mail and reports, or Web 2.0 contribu-
tions offer a great potential for enhancing business-intelligence applications. We
propose a framework for performing text analytics on such data in a versatile, ef-
ficient, and scalable manner. While much of the prior literature has emphasized
mining keywords or tags in blogs or social-tagging communities, we emphasize
the analysis of interesting phrases. These include named entities, important quo-
tations, market slogans, and other multi-word phrases that are prominent in a dy-
namically derived ad-hoc subset of the corpus, e.g., being frequent in the subset
but relatively infrequent in the overall corpus. The ad-hoc subset may be derived
by means of a keyword query against the corpus, or by focusing on a particular
time period. We investigate alternative definitions of phrase interestingness, based
on the probability of phrase occurrences. We develop preprocessing and indexing
methods for phrases, paired with new search techniques for the top-k most inter-
esting phrases on ad-hoc subsets of the corpus. Our framework is evaluated using
a large-scale real-world corpus of New York Times news articles.
References to related material:

To download this research report, please select the type of document that fits best your needs.Attachement Size(s):
mpi-i-2009-5-006.pdf347 KBytes
Please note: If you don't have a viewer for PostScript on your platform, try to install GhostScript and GhostView
URL to this document:

Hide details for BibTeXBibTeX
  AUTHOR = {Bedathur, Srikanta and Berberich, Klaus and Dittrich, Jens and Mamoulis, Nikos and Weikum, Gerhard},
  TITLE = {Scalable phrase mining for ad-hoc text analytics},
  TYPE = {Research Report},
  INSTITUTION = {Max-Planck-Institut f{\"u}r Informatik},
  ADDRESS = {Stuhlsatzenhausweg 85, 66123 Saarbr{\"u}cken, Germany},
  NUMBER = {MPI-I-2009-5-006},
  MONTH = {April},
  YEAR = {2009},
  ISSN = {0946-011X},