J. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev et al., Unsupervised Learning from Narrated Instruction Videos, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.495

URL : https://hal.archives-ouvertes.fr/hal-01171193

S. Bird, E. Klein, and E. Loper, Natural language processing with Python, 2009.

O. Boiman and M. Irani, Detecting irregularities in images and in video. IJCV, 2007.
DOI : 10.1109/iccv.2005.70

P. Bojanowski, R. Lajugie, E. Grave, F. Bach, I. Laptev et al., Weakly-Supervised Alignment of Video with Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.507

URL : https://hal.archives-ouvertes.fr/hal-01154523

D. L. Chen and W. B. Dolan, Collecting highly parallel data for paraphrase evaluation, ACL, 2011.

X. Chen, T. L. Hao-fang, R. Vedantam, S. Gupta, P. Dollr et al., Microsoft COCO captions: Data collection and evaluation server. arXiv preprint, 2015.

P. Das, C. Xu, R. F. Doell, and J. J. Corso, A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.340

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean et al., Devise: A deep visual-semantic embedding model, NIPS, 2013.

M. Gygli, H. Grabner, and L. Van-gool, Video summarization by learning submodular mixtures of objectives, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298928

M. Gygli, Y. Song, and L. Cao, Video2GIF: Automatic Generation of Animated GIFs from Video, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.114

URL : http://arxiv.org/pdf/1605.04850

L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell et al., Localizing moments in video with natural language. arXiv preprint, 2017.
DOI : 10.1109/iccv.2017.618

S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol.4, issue.8, 1997.
DOI : 10.1016/0893-6080(88)90007-X

R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, Modeling Relationships in Referential Expressions with Compositional Modular Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.470

URL : http://arxiv.org/pdf/1611.09978

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko et al., Natural Language Object Retrieval, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.493

URL : http://arxiv.org/pdf/1511.04164

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, 2014.
DOI : 10.1145/2647868.2654889

A. Karpathy, A. Joulin, and F. F. Li, Deep fragment embeddings for bidirectional image sentence mapping, NIPS, 2014.

S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg, ReferItGame: Referring to Objects in Photographs of Natural Scenes, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1086

B. Klein, G. Lev, G. Sadeh, and L. Wolf, Associating neural word embeddings with deep image representations using Fisher Vectors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299073

D. Lin, S. Fidler, C. Kong, and R. Urtasun, Visual Semantic Search: Retrieving Videos via Complex Textual Queries, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.340

URL : http://www.cs.utoronto.ca/%7Efidler/papers/lin_et_al_cvpr14.pdf

]. W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, Multi-task deep visual-semantic embedding for video thumbnail selection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298994

J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille et al., Generation and Comprehension of Unambiguous Object Descriptions, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.9

URL : http://arxiv.org/pdf/1511.02283

J. Mas and G. Fernandez, Video shot boundary detection based on color histogram. Notebook Papers TRECVID2003, 2003.

M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya, Learning Joint Representations of Videos and Sentences with Web Image Search, ECCV Workshops, 2016.
DOI : 10.1109/ICCV.2015.11

Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, Jointly Modeling Embedding and Translation to Bridge Video and Language, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.497

URL : http://arxiv.org/pdf/1505.01861

J. Pennington, R. Socher, and C. D. Manning, Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
DOI : 10.3115/v1/D14-1162

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier et al., Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models, ICCV, 2015.
DOI : 10.1007/s11263-016-0965-7

URL : http://arxiv.org/pdf/1505.04870

M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele et al., Grounding action descriptions in videos, TACL, 2013.

A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, Grounding of Textual Phrases in Images by Reconstruction, p.2016
DOI : 10.1007/978-3-319-10602-1_26

A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal et al., Coherent Multi-sentence Video Description with Variable Level of Detail, 2014.
DOI : 10.1007/978-3-319-11752-2_15

A. Rohrbach, M. Rohrbach, and B. Schiele, The Long-Short Story of Movie Description, 2015.
DOI : 10.1109/ICCV.2013.441

A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, A dataset for Movie Description, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298940

URL : http://arxiv.org/pdf/1501.02530

M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal et al., Translating Video Content to Natural Language Descriptions, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.61

URL : http://ivan-titov.org/papers/iccv13.pdf

O. Sener, A. R. Zamir, S. Savarese, and A. Saxena, Unsupervised Semantic Parsing of Video Collections, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.509

URL : http://arxiv.org/pdf/1506.08438

A. Sharghi, B. Gong, and M. Shah, Query-Focused Extractive Video Summarization, ECCV, 2016.
DOI : 10.1109/CVPR.2014.322

URL : http://arxiv.org/pdf/1607.05177

G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev et al., Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV, 2016.
DOI : 10.1109/ICCV.2015.515

URL : https://hal.archives-ouvertes.fr/hal-01418216

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2015.

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences, 2014.

Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, Tvsum: Summarizing web videos using titles, CVPR, 2015.

M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun et al., MovieQA: Understanding Stories in Movies through Question-Answering, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.501

URL : http://arxiv.org/pdf/1512.02902

S. Tellex and D. Roy, Towards surveillance video search by natural language query, Proceeding of the ACM International Conference on Image and Video Retrieval, CIVR '09, 2009.
DOI : 10.1145/1646396.1646442

URL : http://www.media.mit.edu/cogmac/publications/stefie10-civr2009.pdf

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni et al., The new data and new challenges in multimedia research, 2015.

A. Torabi, C. Pal, H. Larochelle, and A. Courville, Using descriptive video services to create a large data source for video annotation research. arXiv preprint, 2015.

A. Torabi, N. Tandon, and L. Sigal, Learning languagevisual embedding for movie understanding with naturallanguage, 2016.

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell et al., Sequence to Sequence -- Video to Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.515

URL : http://arxiv.org/pdf/1505.00487

S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney et al., Translating Videos to Natural Language Using Deep Recurrent Neural Networks, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.
DOI : 10.3115/v1/N15-1173

URL : http://arxiv.org/pdf/1412.4729

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV, 2016.
DOI : 10.1109/CVPR.2016.219

URL : http://arxiv.org/pdf/1608.00859

J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A Large Video Description Dataset for Bridging Video and Language, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.571

R. Xu, C. Xiong, W. Chen, and J. J. Corso, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework, AAAI, 2015.

H. Yang, B. Wang, S. Lin, D. Wipf, M. Guo et al., Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.526

URL : http://arxiv.org/pdf/1510.01442

T. Yao, T. Mei, and Y. Rui, Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.112

S. Yeung, A. Fathi, and L. Fei-fei, Videoset: Video summary evaluation through text, CVPR Workshops, 2014.

H. Yu and J. M. Siskind, Grounded language learning from video described with sentences, ACL, 2013.

H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2016.496

URL : http://arxiv.org/pdf/1510.07712