D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, ICLR, 2015.

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, 2015.

H. Bilen and A. Vedaldi, Weakly Supervised Deep Detection Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.311

URL : http://arxiv.org/abs/1511.02853

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, Attention-based models for speech recognition, NIPS, 2015.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014.

R. Cinbis, J. Verbeek, and C. Schmid, Multi-fold MIL Training for Weakly Supervised Object Localization, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.309

URL : https://hal.archives-ouvertes.fr/hal-00975746

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., Imagenet: A large-scale hierarchical image database, CVPR, 2009.

J. Donahue, L. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.
DOI : 10.1109/tpami.2016.2599174

URL : http://arxiv.org/abs/1411.4389

H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng et al., From captions to visual concepts and back, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298754

URL : http://arxiv.org/abs/1411.4952

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra, DRAW: A recurrent neural network for image generation, ICML, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, ECCV, 2014.
DOI : 10.1007/978-3-319-10578-9_23

URL : http://arxiv.org/abs/1406.4729

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

J. Jin, K. Fu, R. Cui, F. Sha, and C. Zhang, Aligning where to see and what to tell: image caption with region-based attention and scene factorization, 2015.

J. Johnson, A. Karpathy, and L. Fei-fei, DenseCap: Fully Convolutional Localization Networks for Dense Captioning, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.494

URL : http://arxiv.org/pdf/1511.07571

A. Karpathy and L. Fei-fei, Deep visual-semantic alignments for generating image descriptions, CVPR, 2015.
DOI : 10.1109/tpami.2016.2598339

URL : http://arxiv.org/abs/1412.2306

D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR, 2015.

R. Kiros, R. Salakhutdinov, and R. Zemel, Multimodal neural language models, ICML, 2014.

T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick et al., Microsoft COCO: common objects in context Attention correctness in neural image captioning, ECCV, 2014. [21] AAAI, 2017.

A. Fu and . Berg, SSD: Single shot multibox detector, ECCV, 2016.

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang et al., Deep captioning with multimodal recurrent neural networks (m-RNN) ICLR, 2015.

B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier et al., Flickr30k entities: Collecting region-to-phrase correspondences for richer image-tosentence models, ICCV, 2015.
DOI : 10.1007/s11263-016-0965-7

URL : http://arxiv.org/abs/1505.04870

M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, Sequence level training with recurrent neural networks, 2016.

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015.
DOI : 10.1109/TPAMI.2016.2577031

URL : http://arxiv.org/abs/1506.01497

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, Grounding of Textual Phrases in Images by Reconstruction, ECCV, 2016.
DOI : 10.1007/978-3-319-10602-1_26

O. Russakovsky, Y. Lin, K. Yu, and L. Fei-fei, Objectcentric spatial pooling for image classification, ECCV, 2012.
DOI : 10.1007/978-3-642-33709-3_1

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2015.

I. Sutskever, O. Vinyals, and Q. Le, Sequence to sequence learning with neural networks, 2014.

J. Uijlings, K. Van-de-sande, T. Gevers, and A. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, pp.154-171, 2013.
DOI : 10.1023/B:VISI.0000013087.49260.fb

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.361.3382

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298935

URL : http://arxiv.org/abs/1411.4555

Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van, What Value Do Explicit High Level Concepts Have in Vision to Language Problems?, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.29

URL : http://arxiv.org/abs/1506.01144

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, ICML, 2015.

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. Cohen, Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016.

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal et al., Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.512

URL : http://arxiv.org/abs/1502.08029

S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori et al., Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos, International Journal of Computer Vision, vol.25, issue.1
DOI : 10.1109/CVPR.1992.223161

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, Image Captioning with Semantic Attention, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.503

URL : http://arxiv.org/pdf/1603.03925

C. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_26

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.453.5208

]. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, 2015.

J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, NIPS Deep Learning Workshop, 2014.

J. Donahue, L. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.
DOI : 10.1109/tpami.2016.2599174

URL : http://arxiv.org/abs/1411.4389

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, pp.1735-1780, 1997.
DOI : 10.1016/0893-6080(88)90007-X

J. Jin, K. Fu, R. Cui, F. Sha, and C. Zhang, Aligning where to see and what to tell: image caption with region-based attention and scene factorization, 2015.

K. Papineni, S. Roukos, T. Ward, and W. Zhu, BLEU, Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , ACL '02, 2002.
DOI : 10.3115/1073083.1073135

R. Vedantam, C. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299087

URL : http://arxiv.org/abs/1411.5726

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298935

URL : http://arxiv.org/abs/1411.4555

Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van, What Value Do Explicit High Level Concepts Have in Vision to Language Problems?, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.29

URL : http://arxiv.org/abs/1506.01144

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville et al., Show, attend and tell: Neural image caption generation with visual attention, ICML, 2015.

Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. Cohen, Encode, review, and decode: Reviewer module for caption generation, NIPS, 2016.

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, Image captioning with semantic attentionA surfer is riding a big wave in the ocean Generated caption: -A man is doing a trick on a skateboard Ground-truth captions: -a man flying through the air while riding a skateboard. -A skateboarder is high in the air separated from his board in the break of a rampA man standing in a living room playing a video game Ground-truth captions: -A man standing in a living room holding a Nintendo Wii game controller. -Man standing in a family room with a game controller in his hand. -A man is standing and playing video games. -A man is standing in a living room with a Wii controller. -A middle aged man is playing with a WII game Generated caption: -A baseball player swinging a bat at a ball Ground-truth captions: -A batter holds the bat behind his head for a powerful swing. -A man holding a bat at a professional baseball game. -a baseball player in the batters box at a game -A man is at bat during a professional baseball game. -There are spectators watching a baseball game going on Generated caption: -A vase filled with flowers on a table Ground-truth captions: -A counter containing two silver vases with colorful flowers. -A base with yellow pink and orange daisies in it. -There are some flowers in decorative silver vases -Colorful flowers in a metal vase sitting on a mirror ledge. -Flowers in a paper that is silver and standing. Generated caption: -A bathroom with a toilet sink and bathtub Ground-truth captions: -A tiled bathroom containing a vanity sink, toilet and bathtub. -a bath room with a toilet a bath tub and a sink -There is a bathroom with a toilet and tub. -A white toilet and bath in a room. -A white bathtub sitting next to a toilet Generated caption: -A man in a suit and tie wearing a hat Ground-truth captions: -A man wearing a hat and a gray jacket. -A man with a suit and tie and a fedora on. -A man wears a suit with a blue shirt and a multicolored tie. -A man in a suit and tie wearing a hat. -a man that is in a suite and a tie Generated caption: -A cat is standing next to a bicycle Ground-truth captions: -A cat standing next to a bike parked against a wall. -A ca walking with its tail straight up . -A kitten is walking next to a parked bike inside. -A small cat is walking behind a bike. -A baby tabby cat walking behind a bicycle leaning against a wall Generated caption: -A large airplane flying through a blue sky Ground-truth captions: -A picture of a plane is flying in the air. -An airplane with two propellor engines flying in the sky. -A propeller plane flying through a blue sky. -A small airplane is flying through a clear blue sky. -A propeller plane flying through a blue sky. Generated caption: -A couple of kids laying on a bed Ground-truth captions: -A custom cake featuring a fisherman for a man's 65th birthday. -A birthday cake made to look like a man on a pier fishing. -a fathers birthday cake with a pond and sheep on the hill -The birthday cake is in the shape of a hill with a fisherman sitting on it. -The birthday cake for a 65 year old. Generated caption: -A large body of water with boats in the background Ground-truth captions: -Two boats that are sitting in the water. -Those boats are waiting by the pier in the water. -A large city is on the water with boats. -City skyline as seen beyond waterway docking area. -A large body of water filled with boats next to a tall building. Generated caption: -A man holding a piece of cake on a plate Ground-truth captions: -A man is sitting down with a piece of chocolate cake in front of him with a fork in his hand. -a person at a table with a large piece of cake -This man is holding a fork to eat a piece of chocolate cake. -The man sitting at a table with a large slice of cake. -a man sitting at a table with a piece of cake and holding a fork Generated caption: -A man standing in a kitchen preparing food Ground-truth captions: -A cook in a restaurant kitchen putting chopped vegetables in a bow. -At a restaurant's kitchen, a gentleman wearing sanitary gloves prepares a salad. -A man in a blue shirt is putting food into a bowl. -The man is in the kitchen preparing a meal. -A man in blue shirt preparing food in a kitchen Generated caption: -A herd of sheep grazing on a lush green field Ground-truth captions: -A herd of animals traveling down a country road surrounded by a lush green landscape, CVPRJogger running past a flock of sheep in a rural area -A man walking beside sheep on a country road. -Jogger running past a flock of white sheep. -A runner wearing spandex has come across a large herd traveling down the dirt road. Generated caption: -A woman is eating a piece of pizza Ground-truth captions: -A woman wearing a hat bites into a pastry. -A girl eating a donut out of a bag -a woman eating a powdered sugar pastry out of a bag -A woman eating a pastry at a coffee shop. -The woman in the hat is taking a bite of a pastry, 2016.