非表示:
キーワード:
Computer Science, Information Retrieval, cs.IR
要旨:
Preference judgments have been demonstrated as a better alternative to graded
judgments to assess the relevance of documents relative to queries. Existing
work has verified transitivity among preference judgments when collected from
trained judges, which reduced the number of judgments dramatically. Moreover,
strict preference judgments and weak preference judgments, where the latter
additionally allow judges to state that two documents are equally relevant for
a given query, are both widely used in literature. However, whether
transitivity still holds when collected from crowdsourcing, i.e., whether the
two kinds of preference judgments behave similarly remains unclear. In this
work, we collect judgments from multiple judges using a crowdsourcing platform
and aggregate them to compare the two kinds of preference judgments in terms of
transitivity, time consumption, and quality. That is, we look into whether
aggregated judgments are transitive, how long it takes judges to make them, and
whether judges agree with each other and with judgments from TREC. Our key
findings are that only strict preference judgments are transitive. Meanwhile,
weak preference judgments behave differently in terms of transitivity, time
consumption, as well as of quality of judgment.