English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
 PreviousNext  
  Adversarial Scene Editing for Visual Question Answering

Agarwal, V. (2019). Adversarial Scene Editing for Visual Question Answering. Master Thesis, Universität des Saarlandes, Saarbrücken.

Item is

Files

show Files
hide Files
:
2019 MSc Thesis Vedika Agarwal.pdf (Any fulltext), 63MB
 
File Permalink:
-
Name:
2019 MSc Thesis Vedika Agarwal.pdf
Description:
-
OA-Status:
Visibility:
Restricted (Max Planck Institute for Informatics, MSIN; )
MIME-Type / Checksum:
application/pdf
Technical Metadata:
Copyright Date:
-
Copyright Info:
-
License:
-

Locators

show

Creators

show
hide
 Creators:
Agarwal, Vedika1, Author           
Fritz, Mario2, Advisor
Schiele, Bernt1, Referee                 
Shetty, Rakshith1, Advisor           
Fritz, Mario2, Referee
Affiliations:
1Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society, ou_1116547              
2CISPA- Helmholtz Center for Information Security, Stuhlsatzenhausweg 5, 66123 Saarbrücken, DE, ou_persistent22              

Content

show
hide
Free keywords: -
 Abstract: In the past few years, Visual Question Answering (VQA) has seen immense progress
both in terms of accuracy and network architectures. From a simple end-to-end
neural network-based architecture to complex modular architectures that incorporate
interpretability and explainability, VQA has been a very dynamic area of research.
Recent work have shown despite significant progress, VQA models are notoriously
brittle to linguistic variations in the questions, wherein a small rephrasing of the
question leads the VQA models to change their answer. However the variations in
the images, by editing them in a semantic fashion, have not been studied before (to the
best of our knowledge). In my thesis, we explore how consistent these models are when
we manipulate the images in a semantic fashion, wherein we remove objects irrelevant
to answering the question from the images. Ideally, under this manipulation, the
model should not change its answer. We construct consistency metrics based on
how often models flip their answer. Our findings reveal that a compositional model,
though having slightly lesser accuracy than an attention model, is more robust to
such manipulations. We also show that fine-tuning the model using the generated
edited samples in a strategic manner can help make the model more consistent and
robust.
In the next phase, we target the task of counting in particular, wherein we hope
to teach counting to the model by modulating the frequency of an object. We use
the same method to generate the dataset but this time we remove the object being
counted in the question, one instance at a time. Hence we expect the answer to change.
We evaluate the most robust model’s predictions on this set and see a significant
drop in accuracy. We show that fine-tuning the model using the edited counting set
significantly improves the performance when evaluated on our edited counting set.
In addition, this edited set marginally improves the model’s accuracy on the original
set.

Details

show
hide
Language(s): eng - English
 Dates: 2019-09-302019-09-302019-09-30
 Publication Status: Issued
 Pages: 92 p.
 Publishing info: Saarbrücken : Universität des Saarlandes
 Table of Contents: -
 Rev. Type: -
 Identifiers: BibTex Citekey: Agarwal_Master2019
 Degree: Master

Event

show

Legal Case

show

Project information

show

Source

show