Combining linguistic and statistical analysis to extract relations from web 
documents

Suchanek, Fabian; Ifrim, Georgiana; Weikum, Gerhard

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Lokale TagsFreigabegeschichteDetailsÜbersicht

Freigegeben

Bericht

Combining linguistic and statistical analysis to extract relations from web documents

MPG-Autoren

/persons/resource/persons45572

Suchanek, Fabian
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons44668

Ifrim, Georgiana
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons45720

Weikum, Gerhard
Databases and Information Systems, MPI for Informatics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

MPI-I-2006-5-004.pdf
(beliebiger Volltext), 191KB

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Suchanek, F., Ifrim, G., & Weikum, G.(2006). Combining linguistic and statistical analysis to extract relations from web documents (MPI-I-2006-5-004). Saarbrücken: Max-Planck-Institut für Informatik.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-0014-6710-9

Zusammenfassung

Search engines, question answering systems and classification systems alike can greatly profit from formalized world knowledge. Unfortunately, manually compiled collections of world knowledge (such as WordNet or the Suggested Upper Merged Ontology SUMO) often suffer from low coverage, high assembling costs and fast aging. In contrast, the World Wide Web provides an endless source of knowledge, assembled by millions of people, updated constantly and available for free. In this paper, we propose a novel method for learning arbitrary binary relations from natural language Web documents, without human interaction. Our system, LEILA, combines linguistic analysis and machine learning techniques to find robust patterns in the text and to generalize them. For initialization, we only require a set of examples of the target relation and a set of counterexamples (e.g. from WordNet). The architecture consists of 3 stages: Finding patterns in the corpus based on the given examples, assessing the patterns based on probabilistic confidence, and applying the generalized patterns to propose pairs for the target relation. We prove the benefits and practical viability of our approach by extensive experiments, showing that LEILA achieves consistent improvements over existing comparable techniques (e.g. Snowball, TextToOnto).