Also, our implementation was based off the open-source code from [10]. We thank Mohammad Pezeshki Max Smith for helpful feedback and discussions, as well as Justin Johnson for CLEVR test set evaluations. We thank NVIDIA for donating a DGX-1 computer used in this work. We also acknowledge FRQNT through the CHIST-ERA IGLU project and CPER Nord, Acknowledgements We would like to thank the developers of PyTorch ,
ImageNet classification with deep convolutional neural networks, Proc. of NIPS, 2012. ,
DOI : 10.1162/neco.2009.10-08-881
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf
Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.90
URL : http://arxiv.org/pdf/1512.03385
Learning Phrase Representations using RNN Encoder???Decoder for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1078. ,
DOI : 10.3115/v1/D14-1179
URL : https://hal.archives-ouvertes.fr/hal-01433235
Sequence to sequence learning with neural networks, Proc. of NIPS, 2014. ,
A multi-world approach to question answering about real-world scenes based on uncertain input, Proc. of NIPS, 2014. ,
Visual Turing test for computer vision systems, Proceedings of the National Academy of Sciences, vol.112, issue.12, pp.3618-3623, 2015. ,
DOI : 10.1073/pnas.1422953112
URL : http://www.pnas.org/content/112/12/3618.full.pdf
VQA: Visual Question Answering, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.279
URL : http://m-mitchell.com/papers/1505.00468v2.pdf
GuessWhat?! Visual Object Discovery through Multi-modal Dialogue, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. ,
DOI : 10.1109/CVPR.2017.475
URL : https://hal.archives-ouvertes.fr/hal-01549641
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. ,
DOI : 10.1109/CVPR.2017.215
Inferring and executing programs for visual reasoning, 2017. ,
A simple neural network module for relational reasoning, 1427. ,
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. ,
DOI : 10.1109/CVPR.2017.670
Learning to reason: End-to-end module networks for visual question answering ,
A learned representation for artistic style, Proc. of ICLR, 2017. ,
Modulating early visual processing by language, 2017. ,
URL : https://hal.archives-ouvertes.fr/hal-01648683
Exploring the structure of a real-time, arbitrary neural artistic stylization network, p.6830, 1705. ,
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition, Interspeech 2017, 2017. ,
DOI : 10.21437/Interspeech.2017-556
Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proc. of ICML, 2015. ,
Empirical evaluation of gated recurrent neural networks on sequence modeling, Deep Learning workshop at NIPS, 2014. ,
ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision, vol.1010, issue.1, pp.211-252, 2015. ,
DOI : 10.1007/978-3-642-15555-0_11
URL : http://arxiv.org/pdf/1409.0575
Visual interaction networks, CoRR, 2017. ,
Adam: A method for stochastic optimization, Proc. of ICLR, 2015. ,
Stacked Attention Networks for Image Question Answering, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.10
Visualizing data using t-sne, pp.2579-2605, 2008. ,
Learning to Compose Neural Networks for Question Answering, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016. ,
DOI : 10.18653/v1/N16-1181
Unsupervised representation learning with deep convolutional generative adversarial networks, 2015. ,
Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997. ,
DOI : 10.1016/0893-6080(88)90007-X
Hierarchical Mixtures of Experts and the EM Algorithm, Neural Computation, vol.26, issue.2, pp.181-214, 1994. ,
DOI : 10.1214/aos/1176346060
Wavenet: A generative model for raw audio, 2016. ,
Conditional image generation with pixelcnn decoders, Proc. of NIPS, 2016. ,
Parallel multiscale autoregressive density estimation, 2017. ,
Generating interpretable images with controllable structure, Proc. of ICLR, 2017. ,