Visual Question Answering Literature Survey

List of papers

  • [01VQA] VQA: Visual Question Answering
  • [02EMD] Exploring Models and Data for Image Question Answering
  • [03LAQ] Learning to Answer Questions From Image Using Convolutional Neural Network
  • [04DCQ] Deep Compositional Question Answering with Neural Module Networks
  • [05ABC] An attention based convolutional neural network for visual question answering
  • [06ATM] Are you talking to a machine? datasetand methods for multilingual image question answering
  • [07DPP] Image question answering using convolutional neural networkwith dynamic parameter prediction
  • [08WTL] Where to look: Focus regions for visual question answering
  • [09AMA] Ask me anything: Free-form visual question answering based on knowledge from external sources
  • [10V7W] Visual7W: Grounded Question Answering in Images
  • [11AAA] Ask, Attend and Answer: Exploring question-guided spatial attention for visual question answering
  • [12SAN] Stacked attention networks for image questionanswering
  • [13BOW] Simple Baseline for Visual Question Answering
  • [14ICV] Image Captioning & Visual Question Answering Based on Attributes & External Knowledge
  • [15DMN] Dynamic Memory Networks for Visual and Textual Question Answering
  • [16CMV] Compositional Memory for Visual Question Answering
  • [17LCN] Learning to Compose Neural Networks for Question Answering

Results

VQA Dataset

(as self-published by authors- not verified)

Results below are for testdev-2015, except the final column which is for test-standard

Method All Y/N Other Num Test-Std[All]
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~
Image 28.1 64.0 3.8 0.4 -
Question 48.1 75.7 27.1 36.7 -
Q+I 52.6 75.6 37.4 33.7 -
LSTM Q+I 53.7 78.9 36.4 35.2 54.1
[16CMV] 52.6 78.3 35.9 34.4 -
[09AMA] 55.7 79.2 40.1 36.1 56.0
[13BOW] 55.7 76.5 42.6 35.0 55.9
[07DPP] 57.2 80.7 41.7 37.2 57.4
[17LCN] 57.9 80.5 43.1 37.4 58.0
[15AAA] 57.9 80.8 43.2 37.3 58.2
[12SAN] 58.7 79.3 46.1 36.6 58.9
[15DMN] 60.3 80.5 48.3 36.8 60.4

WordCloud

  • To get a quick sense of where most of the models are headed in terms of architecture, it suffices to see the word cloud (after removing stop-words)

Image Features

GoogleNet vs VGGNet

  • Googlenet
    • [16CMV]
    • [13BOW]
    • [14ICV]
    • [06ATM]
    • [11AAA]
    • [12SAN]
  • VGGNet
    • [12SAN]
    • [06ATM]
    • [01VQA]
    • [14ICV]
    • [04DCQ]
    • [09AMA]
  • Some over lap is because they experimented with both the kind of features
  • Some of them experimented with Alexnet, e.g [06ATM], [16CMV], [09AMA] but for all of them it was underforming model
  • No one yet is using ResNet, yet.

Memory

LSTM vs GRU

  • Only [15DMN] and [07DPP] are using GRU whereas others are using LSTM.
  • However, it should be noted that [15DMN] uses bi-directional GRU whereas [07DPP] uses single directional GRU.

No bidirectinal LSTM

  • While several of them have reported either [02EMD]’s numbers on Bi-directional LSTM (model 2-VIS+BLSTM), no one has used bi-directional LSTM in their main model.

Attention models vs non-attention models

  • Following papers use Attention Models
    • [15DMN] - Attention gates using modified GRUs
    • [04CDQ]
    • [11AAA] - Spatial attention with two hop model
    • [05ABC]
    • [12SAN]
    • [10V7W]
  • Following papers do not use any attention models, but believe that using attention would improve their model’s performance
    • [07DPP]
    • [02EMD]
    • [06ATM]

Episodic memory vs Semantic memory

  • [16CMV] claims to have used “declarative memory” which is combination of both semantic and episodic memory.
  • [15DMN] uses episodic memory modules explicitly.
  • For reference: Tulving, Endel. “Episodic and semantic memory 1.” Organization of Memory. London: Academic 381.4 (1972).

Activation functions used

  • Sigmoid
    • [15DMN] - Also uses ReLU & TanH
    • [06ATM] - Also uses TanH
  • ReLU
    • [11AAA]
    • [03LAQ]
    • [04DCQ]
    • [08WTL]
  • TanH
    • [12SAN]
    • [14ICV]
    • [10V7W]
    • [01VQA]
    • [05ABC]
    • [16CMV]

Use of Batch Normalization

  • Only [07DPP] and [08WTL] have used batch-normalization for their training, even though most of the other deep learning research fields have extensively used the process.
  • For reference: Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).