English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT

Released

Conference Paper

DiscoGeM: A crowdsourced corpus of genre-mixed implicit discourse relations

MPS-Authors
There are no MPG-Authors in the publication available
External Resource
No external resources are shared
Fulltext (restricted access)
There are currently no full texts shared for your IP range.
Fulltext (public)

Scholman_etal_2022_LREC 2022.pdf
(Publisher version), 576KB

Supplementary Material (public)
There is no public supplementary material available
Citation

Scholman, M., Tianai, D., Yung, F., & Demberg, V. (2022). DiscoGeM: A crowdsourced corpus of genre-mixed implicit discourse relations. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. DeClerck, et al. (Eds.), Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022) (pp. 3281-3290). Marseille, France: European Language Resources Association.


Cite as: https://hdl.handle.net/21.11116/0000-000C-7AE3-B
Abstract
We present DiscoGeM, a crowdsourced corpus of 6,505 implicit discourse relations from three genres: political speech,
literature, and encyclopedic texts. Each instance was annotated by 10 crowd workers. Various label aggregation methods
were explored to evaluate how to obtain a label that best captures the meaning inferred by the crowd annotators. The results
show that a significant proportion of discourse relations in DiscoGeM are ambiguous and can express multiple relation senses.
Probability distribution labels better capture these interpretations than single labels. Further, the results emphasize that text
genre crucially affects the distribution of discourse relations, suggesting that genre should be included as a factor in automatic
relation classification. We make available the newly created DiscoGeM corpus, as well as the dataset with all annotator-level
labels. Both the corpus and the dataset can facilitate a multitude of applications and research purposes, for example to
function as training data to improve the performance of automatic discourse relation parsers, as well as facilitate research into
non-connective signals of discourse relations.