Adversarial Scene Editing for Visual Question Answering

Agarwal, Vedika

Local TagsRelease HistoryDetailsSummary

Adversarial Scene Editing for Visual Question Answering

Agarwal, V. (2019). Adversarial Scene Editing for Visual Question Answering. Master Thesis, Universität des Saarlandes, Saarbrücken.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0005-DB59-1 Version Permalink: https://hdl.handle.net/21.11116/0000-000E-232F-7

Genre: Thesis

Files

show Files

hide Files

:

2019 MSc Thesis Vedika Agarwal.pdf (Any fulltext), 63MB

File Permalink:
-

Name:
2019 MSc Thesis Vedika Agarwal.pdf

Description:
-

OA-Status:

Visibility:
Restricted (Max Planck Institute for Informatics, MSIN; )

MIME-Type / Checksum:
application/pdf

Technical Metadata:

Copyright Date:
-

Copyright Info:
-

License:
-

Locators

show

Creators

show

hide

Creators:
Agarwal, Vedika¹, Author
Fritz, Mario², Advisor
Schiele, Bernt¹, Referee
Shetty, Rakshith¹, Advisor
Fritz, Mario², Referee

Affiliations:
1Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society, ou_1116547
2CISPA- Helmholtz Center for Information Security, Stuhlsatzenhausweg 5, 66123 Saarbrücken, DE, ou_persistent22

Content

show

hide

Free keywords: -

Abstract: In the past few years, Visual Question Answering (VQA) has seen immense progress
both in terms of accuracy and network architectures. From a simple end-to-end
neural network-based architecture to complex modular architectures that incorporate
interpretability and explainability, VQA has been a very dynamic area of research.
Recent work have shown despite significant progress, VQA models are notoriously
brittle to linguistic variations in the questions, wherein a small rephrasing of the
question leads the VQA models to change their answer. However the variations in
the images, by editing them in a semantic fashion, have not been studied before (to the
best of our knowledge). In my thesis, we explore how consistent these models are when
we manipulate the images in a semantic fashion, wherein we remove objects irrelevant
to answering the question from the images. Ideally, under this manipulation, the
model should not change its answer. We construct consistency metrics based on
how often models flip their answer. Our findings reveal that a compositional model,
though having slightly lesser accuracy than an attention model, is more robust to
such manipulations. We also show that fine-tuning the model using the generated
edited samples in a strategic manner can help make the model more consistent and
robust.
In the next phase, we target the task of counting in particular, wherein we hope
to teach counting to the model by modulating the frequency of an object. We use
the same method to generate the dataset but this time we remove the object being
counted in the question, one instance at a time. Hence we expect the answer to change.
We evaluate the most robust model’s predictions on this set and see a significant
drop in accuracy. We show that fine-tuning the model using the edited counting set
significantly improves the performance when evaluated on our edited counting set.
In addition, this edited set marginally improves the model’s accuracy on the original
set.

Details

show

hide

Language(s): eng - English

Dates: Submitted: 2019-09-30Accepted: 2019-09-30Date issued: 2019-09-30

Publication Status: Issued

Pages: 92 p.

Publishing info: Saarbrücken : Universität des Saarlandes

Table of Contents: -

Rev. Type: -

Identifiers: BibTex Citekey: Agarwal_Master2019

Degree: Master

Event

show

Legal Case

show

Project information

show

Source

show