S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.279

URL : http://m-mitchell.com/papers/1505.00468v2.pdf

H. Ben-younes, R. Cadène, N. Thome, and M. Cord, MUTAN: Multimodal Tucker Fusion for Visual Question Answering, 2017.

B. Boutonnet and G. Lupyan, Words Jump-Start Vision: A Label Advantage in Object Recognition, Journal of Neuroscience, vol.35, issue.25, pp.9329-9335, 2015.
DOI : 10.1523/JNEUROSCI.5111-14.2015

URL : http://www.jneurosci.org/content/jneuro/35/25/9329.full.pdf

K. Cho, B. Van-merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., Learning Phrase Representations using RNN Encoder???Decoder for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1179

URL : https://hal.archives-ouvertes.fr/hal-01433235

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav et al., Visual Dialog, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.121

H. De-vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle et al., GuessWhat?! Visual Object Discovery through Multi-modal Dialogue, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.475

URL : https://hal.archives-ouvertes.fr/hal-01549641

V. Dumoulin, J. Shlens, and M. Kudlur, A Learned Representation For Artistic Style, Proc. of ICLR, 2017.

F. Ferreira and M. Tanenhaus, Introduction to the special issue on language???vision interactions, Journal of Memory and Language, vol.57, issue.4, pp.455-459, 2007.
DOI : 10.1016/j.jml.2007.08.002

A. Fukui, D. Park, D. Yang, A. Rohrbach, T. Darrell et al., Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
DOI : 10.18653/v1/D16-1044

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural computation, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

J. Jiasen, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, Proc. of NIPS, 2016.

K. Kaiming, Z. Xiangyu, S. Ren, and J. Sun, Deep residual learning for image recognition, Proc. of CVPR, 2016.

J. Kim, S. Lee, D. Kwak, M. Heo, J. Kim et al., Multimodal residual learning for visual qa, Proc. of NIPS, 2016.

J. Kim, K. On, J. Kim, J. Ha, and B. Zhang, Hadamard product for low-rank bilinear pooling, Proc. of ICLR, 2017.

P. Kok, M. Failing, and F. De-lange, Prior Expectations Evoke Stimulus Templates in the Primary Visual Cortex, Journal of Cognitive Neuroscience, vol.17, issue.7, pp.1546-1554, 2014.
DOI : 10.1016/j.tics.2006.05.002

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., Microsoft COCO: Common Objects in Context, Proc of ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_48

URL : http://arxiv.org/pdf/1405.0312.pdf

M. Malinowski, M. Rohrbach, and M. Fritz, Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.9

M. Malinowski, M. Rohrbach, and M. Fritz, Ask Your Neurons: A Deep Learning Approach to Visual Question Answering, International Journal of Computer Vision, vol.1, issue.2, 2016.
DOI : 10.1109/ICCV.2013.211

J. Pennington, R. Socher, and C. Manning, Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1162

M. Ren, R. Kiros, and R. Zemel, Exploring models and data for image question answering, Proc. of NIPS, 2015.

I. Sergey and S. Christian, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proc. of ICML, 2015.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2015.

G. Yashand, K. Tejas, S. Douglas, B. Dhruv, and P. Devi, Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering, Proc. of CVPR, 2017.

G. Thierry, P. Athanasopoulos, A. Wiggett, B. Dering, and J. Kuipers, Unconscious effects of language-specific terminology on preattentive color perception, Proceedings of the National Academy of Sciences, vol.6, issue.5-6, pp.4567-4570, 2009.
DOI : 10.1002/(SICI)1097-0193(1998)6:5/6<383::AID-HBM10>3.0.CO;2-Z

L. Maaten-van, G. Der, and . Hinton, Visualizing data using t-sne, JMLR, vol.9, pp.2579-2605, 2008.

H. Xu and K. Saenko, Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering, Proc. of ECCV, 2015.
DOI : 10.1007/978-3-642-33715-4_54

URL : http://arxiv.org/pdf/1511.05234

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, Proc. of ICML, 2015.

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, Stacked attention networks for image question answering (a) Feature map projection from MODERN (Stage4) (b) Feature map projection from MODERN (Stage3) (c) Feature map projection from MODERN, Proc. of CVPR, 2016.