hide
Free keywords:
-
Abstract:
In the past few years, Visual Question Answering (VQA) has seen immense progress
both in terms of accuracy and network architectures. From a simple end-to-end
neural network-based architecture to complex modular architectures that incorporate
interpretability and explainability, VQA has been a very dynamic area of research.
Recent work have shown despite significant progress, VQA models are notoriously
brittle to linguistic variations in the questions, wherein a small rephrasing of the
question leads the VQA models to change their answer. However the variations in
the images, by editing them in a semantic fashion, have not been studied before (to the
best of our knowledge). In my thesis, we explore how consistent these models are when
we manipulate the images in a semantic fashion, wherein we remove objects irrelevant
to answering the question from the images. Ideally, under this manipulation, the
model should not change its answer. We construct consistency metrics based on
how often models flip their answer. Our findings reveal that a compositional model,
though having slightly lesser accuracy than an attention model, is more robust to
such manipulations. We also show that fine-tuning the model using the generated
edited samples in a strategic manner can help make the model more consistent and
robust.
In the next phase, we target the task of counting in particular, wherein we hope
to teach counting to the model by modulating the frequency of an object. We use
the same method to generate the dataset but this time we remove the object being
counted in the question, one instance at a time. Hence we expect the answer to change.
We evaluate the most robust model’s predictions on this set and see a significant
drop in accuracy. We show that fine-tuning the model using the edited counting set
significantly improves the performance when evaluated on our edited counting set.
In addition, this edited set marginally improves the model’s accuracy on the original
set.