Unsupervised Learning from Narrated Instruction Videos, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.495
URL : https://hal.archives-ouvertes.fr/hal-01171193
Natural language processing with Python, 2009. ,
Detecting irregularities in images and in video. IJCV, 2007. ,
DOI : 10.1109/iccv.2005.70
Weakly-Supervised Alignment of Video with Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.507
URL : https://hal.archives-ouvertes.fr/hal-01154523
Collecting highly parallel data for paraphrase evaluation, ACL, 2011. ,
Microsoft COCO captions: Data collection and evaluation server. arXiv preprint, 2015. ,
A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013. ,
DOI : 10.1109/CVPR.2013.340
Devise: A deep visual-semantic embedding model, NIPS, 2013. ,
Video summarization by learning submodular mixtures of objectives, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298928
Video2GIF: Automatic Generation of Animated GIFs from Video, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.114
URL : http://arxiv.org/pdf/1605.04850
Localizing moments in video with natural language. arXiv preprint, 2017. ,
DOI : 10.1109/iccv.2017.618
Long Short-Term Memory, Neural Computation, vol.4, issue.8, 1997. ,
DOI : 10.1016/0893-6080(88)90007-X
Modeling Relationships in Referential Expressions with Compositional Modular Networks, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. ,
DOI : 10.1109/CVPR.2017.470
URL : http://arxiv.org/pdf/1611.09978
Natural Language Object Retrieval, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.493
URL : http://arxiv.org/pdf/1511.04164
Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14, 2014. ,
DOI : 10.1145/2647868.2654889
Deep fragment embeddings for bidirectional image sentence mapping, NIPS, 2014. ,
ReferItGame: Referring to Objects in Photographs of Natural Scenes, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. ,
DOI : 10.3115/v1/D14-1086
Associating neural word embeddings with deep image representations using Fisher Vectors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7299073
Visual Semantic Search: Retrieving Videos via Complex Textual Queries, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014. ,
DOI : 10.1109/CVPR.2014.340
URL : http://www.cs.utoronto.ca/%7Efidler/papers/lin_et_al_cvpr14.pdf
Multi-task deep visual-semantic embedding for video thumbnail selection, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298994
Generation and Comprehension of Unambiguous Object Descriptions, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.9
URL : http://arxiv.org/pdf/1511.02283
Video shot boundary detection based on color histogram. Notebook Papers TRECVID2003, 2003. ,
Learning Joint Representations of Videos and Sentences with Web Image Search, ECCV Workshops, 2016. ,
DOI : 10.1109/ICCV.2015.11
Jointly Modeling Embedding and Translation to Bridge Video and Language, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.497
URL : http://arxiv.org/pdf/1505.01861
Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014. ,
DOI : 10.3115/v1/D14-1162
Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models, ICCV, 2015. ,
DOI : 10.1007/s11263-016-0965-7
URL : http://arxiv.org/pdf/1505.04870
Grounding action descriptions in videos, TACL, 2013. ,
Grounding of Textual Phrases in Images by Reconstruction, p.2016 ,
DOI : 10.1007/978-3-319-10602-1_26
Coherent Multi-sentence Video Description with Variable Level of Detail, 2014. ,
DOI : 10.1007/978-3-319-11752-2_15
The Long-Short Story of Movie Description, 2015. ,
DOI : 10.1109/ICCV.2013.441
A dataset for Movie Description, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2015.7298940
URL : http://arxiv.org/pdf/1501.02530
Translating Video Content to Natural Language Descriptions, 2013 IEEE International Conference on Computer Vision, 2013. ,
DOI : 10.1109/ICCV.2013.61
URL : http://ivan-titov.org/papers/iccv13.pdf
Unsupervised Semantic Parsing of Video Collections, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.509
URL : http://arxiv.org/pdf/1506.08438
Query-Focused Extractive Video Summarization, ECCV, 2016. ,
DOI : 10.1109/CVPR.2014.322
URL : http://arxiv.org/pdf/1607.05177
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV, 2016. ,
DOI : 10.1109/ICCV.2015.515
URL : https://hal.archives-ouvertes.fr/hal-01418216
Very deep convolutional networks for large-scale image recognition, 2015. ,
Grounded compositional semantics for finding and describing images with sentences, 2014. ,
Tvsum: Summarizing web videos using titles, CVPR, 2015. ,
MovieQA: Understanding Stories in Movies through Question-Answering, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.501
URL : http://arxiv.org/pdf/1512.02902
Towards surveillance video search by natural language query, Proceeding of the ACM International Conference on Image and Video Retrieval, CIVR '09, 2009. ,
DOI : 10.1145/1646396.1646442
URL : http://www.media.mit.edu/cogmac/publications/stefie10-civr2009.pdf
The new data and new challenges in multimedia research, 2015. ,
Using descriptive video services to create a large data source for video annotation research. arXiv preprint, 2015. ,
Learning languagevisual embedding for movie understanding with naturallanguage, 2016. ,
Sequence to Sequence -- Video to Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.515
URL : http://arxiv.org/pdf/1505.00487
Translating Videos to Natural Language Using Deep Recurrent Neural Networks, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015. ,
DOI : 10.3115/v1/N15-1173
URL : http://arxiv.org/pdf/1412.4729
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV, 2016. ,
DOI : 10.1109/CVPR.2016.219
URL : http://arxiv.org/pdf/1608.00859
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.571
Jointly modeling deep video and compositional text to bridge vision and language in a unified framework, AAAI, 2015. ,
Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders, 2015 IEEE International Conference on Computer Vision (ICCV), 2015. ,
DOI : 10.1109/ICCV.2015.526
URL : http://arxiv.org/pdf/1510.01442
Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. ,
DOI : 10.1109/CVPR.2016.112
Videoset: Video summary evaluation through text, CVPR Workshops, 2014. ,
Grounded language learning from video described with sentences, ACL, 2013. ,
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. ,
DOI : 10.1109/CVPR.2016.496
URL : http://arxiv.org/pdf/1510.07712