ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with 
Paraphrase Clusters

Abujabal, Abdalghani; Saha Roy, Rishiraj; Yahya, Mohamed; Weikum, Gerhard

Item

ITEM ACTIONSEXPORT

Add to Basket

Local TagsRelease HistoryDetailsSummary

Released

Paper

ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters

MPS-Authors

/persons/resource/persons123292

Abujabal, Abdalghani
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons185343

Saha Roy, Rishiraj
Databases and Information Systems, MPI for Informatics, Max Planck Society;

/persons/resource/persons45720

Weikum, Gerhard
Databases and Information Systems, MPI for Informatics, Max Planck Society;

External Resource

No external resources are shared

Fulltext (restricted access)

There are currently no full texts shared for your IP range.

Fulltext (public)

arXiv:1809.09528.pdf
(Preprint), 598KB

Supplementary Material (public)

There is no public supplementary material available

Citation

Abujabal, A., Saha Roy, R., Yahya, M., & Weikum, G. (2018). ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters. Retrieved from http://arxiv.org/abs/1809.09528.

Cite as: https://hdl.handle.net/21.11116/0000-0002-A0FE-B

Abstract

To bridge the gap between the capabilities of the state-of-the-art in factoid
question answering (QA) and what real users ask, we need large datasets of real
user questions that capture the various question phenomena users are interested
in, and the diverse ways in which these questions are formulated. We introduce
ComQA, a large dataset of real user questions that exhibit different
challenging aspects such as temporal reasoning, compositionality, etc. ComQA
questions come from the WikiAnswers community QA platform. Through a large
crowdsourcing effort, we clean the question dataset, group questions into
paraphrase clusters, and annotate clusters with their answers. ComQA contains
11,214 questions grouped into 4,834 paraphrase clusters. We detail the process
of constructing ComQA, including the measures taken to ensure its high quality
while making effective use of crowdsourcing. We also present an extensive
analysis of the dataset and the results achieved by state-of-the-art systems on
ComQA, demonstrating that our dataset can be a driver of future research on QA.