Visual Question Answering

Competition Website

Projet website: https://code.google.com/p/word2vec/
Description: This tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research.
Related Paper: [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. [3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.
Nice tutorial on Word2Vec http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors

Project website: http://nlp.stanford.edu/projects/glove/
Description: GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
Paper: GloVe: Global Vectors for Word Representation http://nlp.stanford.edu/projects/glove/glove.pdf

Visual Question Answering (VQA) dataset: Based on images from the COCO dataset,it currently has 360K questions on 120K images. There are plans of releasing questions on the rest of the COCO images and an additional 50K abstract images. All the questions are human-generated, and were specifically designed to stump a “smart robot”.
Visual Madlibs: It contains fill-in-the-blank type questions along with standard question-answer pairs. It has 360K questions on 10K images from the COCO dataset. A lot of questions require high-level human cognition, such as describing what one feels on seeing an image.
Toronto COCO-QA Dataset: Automatically generated questions from the captions of the MS COCO dataset. At 115K questions, it is smaller than the VQA dataset. Answers are all one word.
DAQUAR - DAtaset for QUestion Answering on Real-world images: A much smaller dataset, with about 12K questions. This was one of the earliest datasets on image question and answering.

VQA: Visual Question Answering
http://arxiv.org/abs/1505.00468
Exploring Models and Data for Image Question Answering
http://arxiv.org/abs/1505.02074
Learning to Answer Questions From Image Using Convolutional Neural Network
http://arxiv.org/abs/1506.00333

Deep Compositional Question Answering with Neural Module Networks
http://arxiv.org/abs/1511.02799
An attention basedconvolutional neural network for visual question answering
http://arxiv.org/abs/1511.05960
Are you talking to a machine? datasetand methods for multilingual image question answering
http://arxiv.org/abs/1505.05612
Image question answering using convolutional neural networkwith dynamic parameter prediction
http://arxiv.org/abs/1511.05756
Where to look: Focus regions for visual question answering
http://arxiv.org/abs/1511.07394
Ask me anything: Free-form visual question answering based on knowledge from external sources
http://arxiv.org/abs/1511.06973
Exploring question-guided spatial attention forvisual question answering
http://arxiv.org/abs/1511.05234
Stacked attention networks for image questionanswering
http://arxiv.org/abs/1511.02274
Simple Baseline for Visual Question Answering
http://arxiv.org/abs/1512.02167

Dynamic Memory Networks for Visual and Textual Question Answering
http://arxiv.org/abs/1603.01417