J. De-freitas, O. Pineau, and . Pietquin, Also, our implementation was based off the open-source code from [10]. We thank Mohammad Pezeshki Max Smith for helpful feedback and discussions, as well as Justin Johnson for CLEVR test set evaluations. We thank NVIDIA for donating a DGX-1 computer used in this work. We also acknowledge FRQNT through the CHIST-ERA IGLU project and CPER Nord, Acknowledgements We would like to thank the developers of PyTorch

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Proc. of NIPS, 2012.
DOI : 10.1162/neco.2009.10-08-881

URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90

URL : http://arxiv.org/pdf/1512.03385

K. Cho, B. Van-merrienboer, C. ¸. Gülçehre, F. Bougares, H. Schwenk et al., Learning Phrase Representations using RNN Encoder???Decoder for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1078.
DOI : 10.3115/v1/D14-1179

URL : https://hal.archives-ouvertes.fr/hal-01433235

I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, Proc. of NIPS, 2014.

M. Malinowski and M. Fritz, A multi-world approach to question answering about real-world scenes based on uncertain input, Proc. of NIPS, 2014.

D. Geman, S. Geman, N. Hallonquist, and L. Younes, Visual Turing test for computer vision systems, Proceedings of the National Academy of Sciences, vol.112, issue.12, pp.3618-3623, 2015.
DOI : 10.1073/pnas.1422953112

URL : http://www.pnas.org/content/112/12/3618.full.pdf

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra et al., VQA: Visual Question Answering, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.279

URL : http://m-mitchell.com/papers/1505.00468v2.pdf

H. De-vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle et al., GuessWhat?! Visual Object Discovery through Multi-modal Dialogue, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.475

URL : https://hal.archives-ouvertes.fr/hal-01549641

J. Johnson, B. Hariharan, L. Van-der-maaten, L. Fei-fei, C. L. Zitnick et al., CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.215

J. Johnson, B. Hariharan, L. Van-der-maaten, J. Hoffman, F. Li et al., Inferring and executing programs for visual reasoning, 2017.

A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu et al., A simple neural network module for relational reasoning, 1427.

Y. Goyal, T. Khot, D. Summers-stay, D. Batra, and D. Parikh, Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.670

R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, Learning to reason: End-to-end module networks for visual question answering

V. Dumoulin, J. Shlens, and M. Kudlur, A learned representation for artistic style, Proc. of ICLR, 2017.

H. De-vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin et al., Modulating early visual processing by language, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01648683

G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin, and J. Shlens, Exploring the structure of a real-time, arbitrary neural artistic stylization network, p.6830, 1705.

T. Kim, I. Song, and Y. Bengio, Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition, Interspeech 2017, 2017.
DOI : 10.21437/Interspeech.2017-556

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proc. of ICML, 2015.

J. Chung, C. ¸. Gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, Deep Learning workshop at NIPS, 2014.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.1010, issue.1, pp.211-252, 2015.
DOI : 10.1007/978-3-642-15555-0_11

URL : http://arxiv.org/pdf/1409.0575

N. Watters, A. Tachetti, T. Weber, R. Pascanu, P. Battaglia et al., Visual interaction networks, CoRR, 2017.

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, Proc. of ICLR, 2015.

Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola, Stacked Attention Networks for Image Question Answering, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.10

L. Van-der-maaten and G. Hinton, Visualizing data using t-sne, pp.2579-2605, 2008.

J. Andreas, R. Marcus, T. Darrell, and D. Klein, Learning to Compose Neural Networks for Question Answering, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
DOI : 10.18653/v1/N16-1181

A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, 2015.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

M. I. Jordan and R. A. Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm, Neural Computation, vol.26, issue.2, pp.181-214, 1994.
DOI : 10.1214/aos/1176346060

A. Van-den-oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals et al., Wavenet: A generative model for raw audio, 2016.

A. Van-den-oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, and A. Graves, Conditional image generation with pixelcnn decoders, Proc. of NIPS, 2016.

S. E. Reed, A. Van-den-oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang et al., Parallel multiscale autoregressive density estimation, 2017.

S. Reed, A. Van-den-oord, N. Kalchbrenner, V. Bapst, M. Botvinick et al., Generating interpretable images with controllable structure, Proc. of ICLR, 2017.