P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson et al., Bottomup and top-down attention for image captioning and visual question answering. arXiv preprint, 2017.

D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, ICLR, 2015.

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, NIPS, 2015.

S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz et al., Generating Sentences from a Continuous Space, Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2015.
DOI : 10.18653/v1/K16-1002
URL : http://arxiv.org/pdf/1511.06349

A. Canziani, A. Paszke, and E. Culurciello, An analysis of deep neural network models for practical applications. arXiv preprint, 2016.

K. Chang, A. Krishnamurthy, A. Agarwal, H. Daumé, I. et al., Learning to search better than your teacher, ICML, 2015.

C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen et al., State-of-the-art speech recognition with sequence-to-sequence models, 2017.

K. Cho, B. Van-merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares et al., Learning Phrase Representations using RNN Encoder???Decoder for Statistical Machine Translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1179
URL : https://hal.archives-ouvertes.fr/hal-01433235

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, Attention-based models for speech recognition, NIPS, 2015.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014.

H. Daumé, I. , J. Langford, and D. Marcu, Search-based structured prediction, Machine Learning, pp.297-325, 2009.
DOI : 10.1007/s10994-009-5106-x

M. Denkowski and A. Lavie, Meteor Universal: Language Specific Translation Evaluation for Any Target Language, Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
DOI : 10.3115/v1/W14-3348
URL : http://www.aclweb.org/anthology/W/W14/W14-3348.pdf

M. Everingham, L. Van-gool, C. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, pp.303-338, 2010.
DOI : 10.1371/journal.pcbi.0040027
URL : http://eprints.pascal-network.org/archive/00006961/01/everingham10.pdf

M. Fadaee, A. Bisazza, and C. Monz, Data Augmentation for Low-Resource Neural Machine Translation, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017.
DOI : 10.18653/v1/P17-2090
URL : https://doi.org/10.18653/v1/p17-2090

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90
URL : http://arxiv.org/pdf/1512.03385

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML, 2015.

M. Iyyer, V. Manjunatha, J. Boyd-graber, H. Daumé, and I. , Deep Unordered Composition Rivals Syntactic Methods for Text Classification, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015.
DOI : 10.3115/v1/P15-1162

A. Karpathy and F. Li, Deep visualsemantic alignments for generating image descriptions, CVPR, 2015.
DOI : 10.1109/tpami.2016.2598339
URL : http://arxiv.org/pdf/1412.2306

D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

R. Kiros, R. Salakhutdinov, and R. Zemel, Multimodal neural language models, ICML, 2014.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations, International Journal of Computer Vision, vol.2, issue.1???2, pp.32-73, 2017.
DOI : 10.1109/CVPR.2013.387
URL : https://link.springer.com/content/pdf/10.1007%2Fs11263-016-0981-7.pdf

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, 2012.
DOI : 10.1162/neco.2009.10-08-881
URL : http://dl.acm.org/ft_gateway.cfm?id=3065386&type=pdf

D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas et al., Zoneout: Regularizing RNNs by randomly preserving hidden activations, 2017.

A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury et al., Ask me anything: Dynamic memory networks for natural language processing, ICML, 2016.

R. Leblond, J. Alayrac, A. Osokin, and S. Lacoste-julien, SeaRnn: Training RNNs with globallocal losses, ICLR, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01665263

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, pp.2278-2324, 1998.
DOI : 10.1109/5.726791
URL : http://www.cs.berkeley.edu/~daf/appsem/Handwriting/papers/00726791.pdf

C. Lin, Rouge: a package for automatic evaluation of summaries, ACL Workshop Text Summarization Branches Out, 2004.

J. Lu, C. Xiong, D. Parikh, and R. Socher, Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.345

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, ICLR, 2013.

M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster et al., Reward augmented maximum likelihood for neural structured prediction, NIPS, 2016.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02, 2002.
DOI : 10.3115/1073083.1073135

M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid, Transformation Pursuit for Image Classification, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.466
URL : https://hal.archives-ouvertes.fr/hal-00979464

M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek, Areas of Attention for Image Captioning, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.140
URL : https://hal.archives-ouvertes.fr/hal-01428963

J. Pennington, R. Socher, and C. Manning, Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1162
URL : https://doi.org/10.3115/v1/d14-1162

V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, Dropout Improves Recurrent Neural Networks for Handwriting Recognition, 2014 14th International Conference on Frontiers in Handwriting Recognition, 2014.
DOI : 10.1109/ICFHR.2014.55
URL : http://arxiv.org/pdf/1312.4569

M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, Sequence level training with recurrent neural networks, 2016.

S. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, Self-Critical Sequence Training for Image Captioning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.131
URL : http://arxiv.org/pdf/1612.00563

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, 2014.

I. Sutskever, O. Vinyals, and Q. Le, Sequence to sequence learning with neural networks, NIPS, 2014.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the Inception Architecture for Computer Vision, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.308

R. Vedantam, C. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299087
URL : http://arxiv.org/pdf/1411.5726.pdf

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298935
URL : http://arxiv.org/pdf/1411.4555

Z. Xie, S. Wang, J. Li, D. Lévy, A. Nie et al., Data noising as smoothing in neural network language models, 2017.

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., 2015. Show, attend and tell: Neural image caption generation with visual attention, ICML
URL : https://hal.archives-ouvertes.fr/hal-01466414

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. Cohen, Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016.

T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, Boosting Image Captioning with Attributes, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.524

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, Image Captioning with Semantic Attention, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.503
URL : http://arxiv.org/pdf/1603.03925