D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, ICLR, 2015.

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, 2015.

H. Bilen and A. Vedaldi, Weakly Supervised Deep Detection Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.311
URL : http://arxiv.org/abs/1511.02853

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta et al., Microsoft coco captions: Data collection and evaluation server, 2015.

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, Attention-based models for speech recognition, NIPS, 2015.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014.

R. Cinbis, J. Verbeek, and C. Schmid, Multi-fold MIL Training for Weakly Supervised Object Localization, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.309
URL : https://hal.archives-ouvertes.fr/hal-00975746

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, CVPR, 2009.

M. Denkowski and A. Lavie, Meteor Universal: Language Specific Translation Evaluation for Any Target Language, Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
DOI : 10.3115/v1/W14-3348
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.675.6117

J. Donahue, L. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7298878
URL : http://arxiv.org/abs/1411.4389

H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng et al., From captions to visual concepts and back, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298754
URL : http://arxiv.org/abs/1411.4952

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra, DRAW: A recurrent neural network for image generation, ICML, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, ECCV, 2014.

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

J. Jin, K. Fu, R. Cui, F. Sha, C. Zhang et al., Aligning where to see and what to tell: image caption with region-based attention and scene factorization DenseCap: Fully convolutional localization networks for dense captioning, CVPR, 2016.

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7298932
URL : http://arxiv.org/abs/1412.2306

D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

R. Kiros, R. Salakhutdinov, and R. Zemel, Multimodal neural language models, ICML, 2014.

D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas et al., Zoneout: Regularizing RNNs by randomly preserving hidden activations

T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick et al., Microsoft COCO: Common Objects in Context, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_48
URL : http://arxiv.org/abs/1405.0312

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed et al., SSD: Single Shot MultiBox Detector, ECCV, 2016.
DOI : 10.1007/978-3-642-33712-3_25
URL : http://arxiv.org/abs/1512.02325

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang et al., Deep captioning with multimodal recurrent neural networks (m-RNN) ICLR, 2015.

T. Mikolov, W. Yih, and G. Zweig, Linguistic regularities in continuous space word representations, NAACL-HLT, 2013.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02, 2002.
DOI : 10.3115/1073083.1073135

M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, Sequence level training with recurrent neural networks, 2016.

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015.
DOI : 10.1109/TPAMI.2016.2577031
URL : http://arxiv.org/abs/1506.01497

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, Grounding of Textual Phrases in Images by Reconstruction, ECCV, 2016.
DOI : 10.1007/978-3-319-46448-0_49

O. Russakovsky, Y. Lin, K. Yu, and L. Fei-fei, Objectcentric spatial pooling for image classification, ECCV, 2012.
DOI : 10.1007/978-3-642-33709-3_1

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2015.

I. Sutskever, O. Vinyals, and Q. Le, Sequence to sequence learning with neural networks, 2014.

J. Uijlings, K. Van-de-sande, T. Gevers, and A. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.154-171, 2013.
DOI : 10.1007/s11263-013-0620-5
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.361.3382

R. Vedantam, C. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299087
URL : http://arxiv.org/abs/1411.5726

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298935
URL : http://arxiv.org/abs/1411.4555

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, ICML, 2015.

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. Cohen, Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016.

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal et al., Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.512
URL : http://arxiv.org/abs/1502.08029

S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori et al., Every moment counts: Dense detailed labeling of actions in complex videos

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, Image Captioning with Semantic Attention, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.503
URL : http://arxiv.org/abs/1603.03925

C. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_26
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.453.5208