Visual Question Answering Literature Survey

List of papers

[01VQA] VQA: Visual Question Answering
[02EMD] Exploring Models and Data for Image Question Answering
[03LAQ] Learning to Answer Questions From Image Using Convolutional Neural Network
[04DCQ] Deep Compositional Question Answering with Neural Module Networks
[05ABC] An attention based convolutional neural network for visual question answering
[06ATM] Are you talking to a machine? datasetand methods for multilingual image question answering
[07DPP] Image question answering using convolutional neural networkwith dynamic parameter prediction
[08WTL] Where to look: Focus regions for visual question answering
[09AMA] Ask me anything: Free-form visual question answering based on knowledge from external sources
[10V7W] Visual7W: Grounded Question Answering in Images
[11AAA] Ask, Attend and Answer: Exploring question-guided spatial attention for visual question answering
[12SAN] Stacked attention networks for image questionanswering
[13BOW] Simple Baseline for Visual Question Answering
[14ICV] Image Captioning & Visual Question Answering Based on Attributes & External Knowledge
[15DMN] Dynamic Memory Networks for Visual and Textual Question Answering
[16CMV] Compositional Memory for Visual Question Answering
[17LCN] Learning to Compose Neural Networks for Question Answering

(as self-published by authors- not verified)

Results below are for testdev-2015, except the final column which is for test-standard

Method	All	Y/N	Other	Num	Test-Std[All]
~~~~~~~~~~~~~~	~~~~~~~~~~~~~	~~~~~~~~~~~~	~~~~~~~~~	~~~~~~~~~~	~~~~~~~~
Image	28.1	64.0	3.8	0.4	-
Question	48.1	75.7	27.1	36.7	-
Q+I	52.6	75.6	37.4	33.7	-
LSTM Q+I	53.7	78.9	36.4	35.2	54.1
[16CMV]	52.6	78.3	35.9	34.4	-
[09AMA]	55.7	79.2	40.1	36.1	56.0
[13BOW]	55.7	76.5	42.6	35.0	55.9
[07DPP]	57.2	80.7	41.7	37.2	57.4
[17LCN]	57.9	80.5	43.1	37.4	58.0
[15AAA]	57.9	80.8	43.2	37.3	58.2
[12SAN]	58.7	79.3	46.1	36.6	58.9
[15DMN]	60.3	80.5	48.3	36.8	60.4

To get a quick sense of where most of the models are headed in terms of architecture, it suffices to see the word cloud (after removing stop-words)

Googlenet
- [16CMV]
- [13BOW]
- [14ICV]
- [06ATM]
- [11AAA]
- [12SAN]
VGGNet
- [12SAN]
- [06ATM]
- [01VQA]
- [14ICV]
- [04DCQ]
- [09AMA]
Some over lap is because they experimented with both the kind of features
Some of them experimented with Alexnet, e.g [06ATM], [16CMV], [09AMA] but for all of them it was underforming model
No one yet is using ResNet, yet.

Only [15DMN] and [07DPP] are using GRU whereas others are using LSTM.
However, it should be noted that [15DMN] uses bi-directional GRU whereas [07DPP] uses single directional GRU.

While several of them have reported either [02EMD]’s numbers on Bi-directional LSTM (model 2-VIS+BLSTM), no one has used bi-directional LSTM in their main model.

Following papers use Attention Models
- [15DMN] - Attention gates using modified GRUs
- [04CDQ]
- [11AAA] - Spatial attention with two hop model
- [05ABC]
- [12SAN]
- [10V7W]
Following papers do not use any attention models, but believe that using attention would improve their model’s performance
- [07DPP]
- [02EMD]
- [06ATM]

[16CMV] claims to have used “declarative memory” which is combination of both semantic and episodic memory.
[15DMN] uses episodic memory modules explicitly.
For reference: Tulving, Endel. “Episodic and semantic memory 1.” Organization of Memory. London: Academic 381.4 (1972).

Only [07DPP] and [08WTL] have used batch-normalization for their training, even though most of the other deep learning research fields have extensively used the process.
For reference: Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).