Improved Multilingual Temporal Tagging with HeidelTime

Ahmad, Faraz

Item

ITEM ACTIONSEXPORT

Add to Basket

Please note that a newer version of this item is available:
https://pure.mpg.de/pubman/item/item_3010465_4

DetailsSummary

Released

Thesis

Improved Multilingual Temporal Tagging with HeidelTime

MPS-Authors

Ahmad, Faraz
International Max Planck Research School, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

There are no public fulltexts stored in PuRe

Supplementary Material (public)

There is no public supplementary material available

Citation

Ahmad, F. (2018). Improved Multilingual Temporal Tagging with HeidelTime. Master Thesis, Universität des Saarlandes, Saarbrücken.

Cite as: https://hdl.handle.net/21.11116/0000-0002-B37F-6

Abstract

One important sub-task of information extraction is that of temporal tagging.
Temporal tagging is a two step process that consists of extracting the tempo-
ral expressions and normalizing them to a standard ISO date format. This is
an important task because temporal information can be utlilised to make robust
question answering systems, enrich knowledge bases with temporal information,
return better search results that are time-aware, among others. One multilingual
and domain-sensitive temporal tagger that is available freely is HeidelTime. It is
a rule-based tagger that can tag documents in 13 languages using manually devel-
oped resources by language experts; in addition to that, it can also tag documents
in over 200 languages using automatically developed resources. It can also tag
documents in various domains such as news or narrative type documents.

In this thesis, we extend the current HeidelTime multilingual model to create bet-
ter automatically devloped resources for over 200 languages, so that the baseline
tagging performance of HeidelTime for these, more than 200, languages can be
improved. We extend the model in three ways: 1) We improve the automatically
developed resources for the morphologically rich languages such as Finnish, Esto-
nian, etc. 2) We improve the automatically developed resources for unsegmented
languages such as Chinese and Japanese. 3) We improve the automatically devel-
oped resources generally for all the languages by enriching language-independent
rules with new language-dependent rules that are learned from frequently occur-
ring temporal patterns in respective languages. Finally, we present our results of
running several evaluations and experiments using available temporally annotated
corpora and Wikipedia dumps for various languages, and summarize our findings.