P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottom-up and top-down attention for image captioning and vqa, VQA Workshop at CVPR, 2017.

J. Andreas, R. Marcus, T. Darrell, and D. Klein, Learning to Compose Neural Networks for Question Answering, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
DOI : 10.18653/v1/N16-1181

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.279

URL : http://m-mitchell.com/papers/1505.00468v2.pdf

A. Bordes, N. Usunier, A. Garcia-duran, J. Weston, and O. Yakhnenko, Translating embeddings for modeling multirelational data, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00920777

J. Chung, C. ¸. Gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, Deep Learning Workshop at NIPS, 2014.

H. De-vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin et al., Modulating early visual processing by language, NIPS, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01648683

V. Dumoulin, J. Shlens, and M. Kudlur, A learned representation for artistic style, ICLR, 2017.

D. Eigen, M. Ranzato, and I. Sutskever, Learning factored representations in a deep mixture of experts, ICLR Workshops, 2014.

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, Convolutional sequence to sequence learning, ICML, 2017.

D. Geman, S. Geman, N. Hallonquist, and L. Younes, Visual Turing test for computer vision systems, Proceedings of the National Academy of Sciences, vol.112, pp.3618-3623, 2015.
DOI : 10.1073/pnas.1422953112

URL : http://www.pnas.org/content/112/12/3618.full.pdf

G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens, Exploring the structure of a real-time, arbitrary neural artistic stylization network, 2017.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.670

K. Guu, J. Miller, and P. Liang, Traversing Knowledge Graphs in Vector Space, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
DOI : 10.18653/v1/D15-1038

URL : http://aclweb.org/anthology/D/D15/D15-1038.pdf

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90

URL : http://arxiv.org/pdf/1512.03385

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

J. Hu, L. Shen, and G. Sun, Squeeze-and-Excitation Networks, ILSVRC 2017 Workshop at CVPR, 2017.

X. Huang and S. Belongie, Arbitrary style transfer in realtime with adaptive instance normalization, ICCV, 2017.

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML, 2015.

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, C. L. Zitnick et al., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.215

J. Johnson, B. Hariharan, L. Van-der-maaten, J. Hoffman, F. Li et al., Inferring and executing programs for visual reasoning, ICCV, 2017.

M. I. Jordan and R. A. Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm, Neural Computation, vol.26, issue.2, pp.181-214, 1994.
DOI : 10.1214/aos/1176346060

T. Kim, I. Song, Y. Bengio, and J. Ba, Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition Adam: A method for stochastic optimization, 2015.

J. Lu, J. Yang, D. Batra, and D. Parikh, Hierarchical question-image co-attention for visual question answering, NIPS, 2016.

M. Malinowski and M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, NIPS, 2014.

M. Malinowski, M. Rohrbach, and M. Fritz, Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.9

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Distributed representations of words and phrases and their compositionality, NIPS, 2013.

J. Oh, S. Singh, H. Lee, and P. Kholi, Zero-shot task generalization with multi-task deep reinforcement learning, ICML, 2017.

E. Perez, H. De-vries, F. Strub, V. Dumoulin, A. C. Courville et al., Learning visual reasoning without strong priors Unsupervised representation learning with deep convolutional generative adversarial networks, MLSLP Workshop at ICML, 2016.

Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg et al., Imagenet large scale visual recognition challenge, IJCV, vol.115, issue.3, pp.211-252, 2015.

A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu et al., A simple neural network module for relational reasoning, 2017.

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le et al., Outrageously large neural networks: The sparsely-gated mixture-of-experts layer Wavenet: A generative model for raw audio Conditional image generation with pixelcnn decoders Visualizing data using t-sne, ICLR. van den Oord, NIPS. van der Maaten, L., and Hinton, pp.2579-2605, 2008.

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, Stacked Attention Networks for Image Question Answering, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.10