Word embeddings for practical information retrieval

Galke, Lukas; Saleh, Ahmed; Scherp, Ansgar

doi:10.18420/in2017_215

Local TagsRelease HistoryDetailsSummary

Word embeddings for practical information retrieval

Galke, L., Saleh, A., & Scherp, A. (2017). Word embeddings for practical information retrieval. In M. Eibl, & M. Gaedke (Eds.), INFORMATIK 2017 (pp. 2155-2167). Bonn: Gesellschaft für Informatik. doi:10.18420/in2017_215.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0009-F9B3-4 Version Permalink: https://hdl.handle.net/21.11116/0000-0009-F9B4-3

Genre: Conference Paper

Files

show Files

hide Files

:

Galke_etal_2017_Evaluating the impact of....pdf (Publisher version), 385KB

View Save

File Permalink:
https://hdl.handle.net/21.11116/0000-0009-F9B5-2

Name:
Galke_etal_2017_Evaluating the impact of....pdf

Description:
-

OA-Status:

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
-

Locators

show

Creators

show

hide

Creators:
Galke, Lukas¹, Author
Saleh, Ahmed, Author
Scherp, Ansgar, Author

Affiliations:
1Kiel University, Kiel, Germany, ou_persistent22

Content

show

hide

Free keywords: -

Abstract: We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models’ sensitivity to document length by using either only the title or the full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2017

Publication Status: Published online

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.18420/in2017_215

Degree: -

Event

show

hide

Title: Informatik 2017

Place of Event: Chemnitz, Germany

Start-/End Date: 2017-09-25 - 2017-09-29

Legal Case

show

Project information

show

Source 1

show

hide

Title: INFORMATIK 2017

Source Genre: Proceedings

Creator(s):
Eibl, M., Editor
Gaedke, M., Editor

Affiliations:
-

Publ. Info: Bonn : Gesellschaft für Informatik

Pages: - Volume / Issue: - Sequence Number: - Start / End Page: 2155 - 2167 Identifier: -