List of papers
- [01VQA] VQA: Visual Question Answering
- [02EMD] Exploring Models and Data for Image Question Answering
- [03LAQ] Learning to Answer Questions From Image Using Convolutional Neural Network
- [04DCQ] Deep Compositional Question Answering with Neural Module Networks
- [05ABC] An attention based convolutional neural network for visual question answering
- [06ATM] Are you talking to a machine? datasetand methods for multilingual image question answering
- [07DPP] Image question answering using convolutional neural networkwith dynamic parameter prediction
- [08WTL] Where to look: Focus regions for visual question answering
- [09AMA] Ask me anything: Free-form visual question answering based on knowledge from external sources
- [10V7W] Visual7W: Grounded Question Answering in Images
- [11AAA] Ask, Attend and Answer: Exploring question-guided spatial attention for visual question answering
- [12SAN] Stacked attention networks for image questionanswering
- [13BOW] Simple Baseline for Visual Question Answering
- [14ICV] Image Captioning & Visual Question Answering Based on Attributes & External Knowledge
- [15DMN] Dynamic Memory Networks for Visual and Textual Question Answering
- [16CMV] Compositional Memory for Visual Question Answering
- [17LCN] Learning to Compose Neural Networks for Question Answering
(as self-published by authors- not verified)
Results below are for testdev-2015, except the final column which is for test-standard
- To get a quick sense of where most of the models are headed in terms of architecture, it suffices to see the word cloud (after removing stop-words)
GoogleNet vs VGGNet
- Some over lap is because they experimented with both the kind of features
- Some of them experimented with Alexnet, e.g [06ATM], [16CMV], [09AMA] but for all of them it was underforming model
- No one yet is using ResNet, yet.
LSTM vs GRU
- Only [15DMN] and [07DPP] are using GRU whereas others are using LSTM.
- However, it should be noted that [15DMN] uses bi-directional GRU whereas [07DPP] uses single directional GRU.
No bidirectinal LSTM
- While several of them have reported either [02EMD]’s numbers on Bi-directional LSTM (model 2-VIS+BLSTM), no one has used bi-directional LSTM in their main model.
Attention models vs non-attention models
- Following papers use Attention Models
- [15DMN] - Attention gates using modified GRUs
- [11AAA] - Spatial attention with two hop model
- Following papers do not use any attention models, but believe that using attention would improve their model’s performance
Episodic memory vs Semantic memory
- [16CMV] claims to have used “declarative memory” which is combination of both semantic and episodic memory.
- [15DMN] uses episodic memory modules explicitly.
- For reference: Tulving, Endel. “Episodic and semantic memory 1.” Organization of Memory. London: Academic 381.4 (1972).
Activation functions used
- [15DMN] - Also uses ReLU & TanH
- [06ATM] - Also uses TanH
Use of Batch Normalization
- Only [07DPP] and [08WTL] have used batch-normalization for their training, even though most of the other deep learning research fields have extensively used the process.
- For reference: Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv:1502.03167 (2015).