Learning semantic sentence representations from visually grounded language 
without lexical knowledge

Merkx, Danny; Frank, Stefan L.

doi:10.1017/S1351324919000196

Local TagsRelease HistoryDetailsSummary

Learning semantic sentence representations from visually grounded language without lexical knowledge

Merkx, D., & Frank, S. L. (2019). Learning semantic sentence representations from visually grounded language without lexical knowledge. Natural Language Engineering, 25, 451-466. doi:10.1017/S1351324919000196.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000B-5E7A-4 Version Permalink: https://hdl.handle.net/21.11116/0000-000B-5E7B-3

Genre: Journal Article

Files

show Files

hide Files

:

learning-semantic-sentence-representations-from-visually-grounded-language-without-lexical-knowledge.pdf (Publisher version), 674KB

View Save

File Permalink:
https://hdl.handle.net/21.11116/0000-000B-5E7C-2

Name:
learning-semantic-sentence-representations-from-visually-grounded-language-without-lexical-knowledge.pdf

Description:
-

OA-Status:
Not specified

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
2019

Copyright Info:
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.

License:
http://creativecommons.org/licenses/by/4.0/

Locators

show

Creators

show

hide

Creators:
Merkx, Danny^{1, 2}, Author
Frank, Stefan L., Author

Affiliations:
1Center for Language Studies, External Organizations, ou_55238
2International Max Planck Research School for Language Sciences, MPI for Psycholinguistics, Max Planck Society, Nijmegen, NL, ou_1119545

Content

show

hide

Free keywords: -

Abstract: Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2019-07-31Date issued: 2019

Publication Status: Issued

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.1017/S1351324919000196

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: Natural Language Engineering

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: Cambridge, UK : Cambridge University Press

Pages: - Volume / Issue: 25 Sequence Number: - Start / End Page: 451 - 466 Identifier: ISSN: 1351-3249
CoNE: https://pure.mpg.de/cone/journals/resource/954925342769