P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid et al., Finding Actors and Actions in Movies, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.283

URL : https://hal.archives-ouvertes.fr/hal-00904991

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, ECCV, 2004.
DOI : 10.1007/978-3-540-24673-2_3

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.1732

M. Bucher, S. Herbin, and F. Jurie, Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication, ECCV, 2016.
DOI : 10.1007/s11263-013-0695-z

V. Escorcia, J. C. Niebles, and B. Ghanem, On the relationship between visual attributes and convolutional networks, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298730

M. Everingham, L. Van-gool, C. Williams, J. Winn, and A. Zisserman, The Pascal Visual Object Classes (VOC) Challenge, International Journal of Computer Vision, vol.73, issue.2, 2005.
DOI : 10.1371/journal.pcbi.0040027

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.6629

P. Felzenszwalb, R. Girshick, D. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, 2010.
DOI : 10.1109/TPAMI.2009.167

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.81

URL : http://arxiv.org/abs/1311.2524

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005.
DOI : 10.1109/CVPR.2015.7298676

M. Grundmann, V. Kwatra, M. Han, and I. Essa, Efficient hierarchical graph-based video segmentation, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007.
DOI : 10.1109/CVPR.2010.5539893

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.294.4979

A. Gupta, A. Kembhavi, and L. S. Davis, Observing humanobject interactions: Using spatial and functional compatibility for recognition, IEEE Trans. on PAMI, issue.2, 2009.
DOI : 10.1109/tpami.2009.83

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298698

URL : http://repository.kaust.edu.sa/kaust/bitstream/10754/556141/1/ActivityNet_CVPR2015.pdf

V. Kalogeiton, C. Schmid, and V. Ferrari, Analysing Domain Shift Factors between Videos and Images for Object Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.38, issue.11, 2016.
DOI : 10.1109/TPAMI.2016.2551239

URL : https://hal.archives-ouvertes.fr/hal-01281069

K. Kang, W. Ouyang, H. Li, and X. Wang, Object Detection from Video Tubelets with Convolutional Neural Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2002.
DOI : 10.1109/CVPR.2016.95

URL : http://arxiv.org/pdf/1604.04053

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.223

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.471.3312

A. Klaser, M. Marszalek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, 2008.
DOI : 10.5244/C.22.99

URL : https://hal.archives-ouvertes.fr/inria-00514853

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Communications of the ACM, vol.60, issue.6, 2004.
DOI : 10.1162/neco.2009.10-08-881

C. H. Lampert, H. Nickisch, and S. Harmeling, Attributebased classification for zero-shot visual object categorization, IEEE Trans. on PAMI, issue.3, 2014.
DOI : 10.1109/tpami.2013.140

T. Lan, Y. Zhu, A. Zamir, and S. Savarese, Action Recognition by Hierarchical Mid-Level Action Elements, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.517

URL : http://arxiv.org/abs/1508.07654

I. L. Lin, M. Maire, S. Belongie, J. Hays, P. Perona et al., On space-time interest points. IJCV Microsoft coco: Common objects in context, ECCV, 2005.

J. Liu, B. Kuipers, and S. Savarese, Recognizing human actions by attributes, CVPR 2011, 2006.
DOI : 10.1109/CVPR.2011.5995353

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.463.8447

A. C. Fu and . Berg, SSD: Single shot multibox detector, ECCV, 2016. 1

C. Lu, R. Krishna, M. Bernstein, and L. Fei-fei, Visual Relationship Detection with Language Priors, ECCV, 2008.
DOI : 10.1023/B:VISI.0000029664.99615.94

URL : http://arxiv.org/abs/1608.00187

S. Ma, J. Zhang, N. Ikizler-cinbis, and S. Sclaroff, Action Recognition and Localization by Hierarchical Space-Time Segments, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.341

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.663.1492

T. Malisiewicz, A. Gupta, and A. Efros, Ensemble of exemplar-SVMs for object detection and beyond, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126229

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang et al., Deep captioning with multimodal recurrent neural networks (m-rnn), ICLR, 2015.

M. Pandey and S. Lazebnik, Scene recognition and weakly supervised object localization with deformable part-based models, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126383

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.300.7841

X. Peng and C. Schmid, Multi-region Two-Stream R-CNN for Action Detection, ECCV, 2005.
DOI : 10.1109/CVPR.2015.7298735

URL : https://hal.archives-ouvertes.fr/hal-01349107

P. O. Pinheiro, T. Lin, R. Collobert, and P. Dolì-ar, Learning to Refine Object Segments, ECCV, 2007.
DOI : 10.5244/C.30.15

URL : http://arxiv.org/abs/1603.08695

A. Prest, V. Ferrari, and C. Schmid, Explicit Modeling of Human-Object Interactions in Realistic Videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.4, 2013.
DOI : 10.1109/TPAMI.2012.175

URL : https://hal.archives-ouvertes.fr/inria-00626929

A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, Learning object class detectors from weakly annotated video, 2012 IEEE Conference on Computer Vision and Pattern Recognition, p.5
DOI : 10.1109/CVPR.2012.6248065

URL : https://hal.archives-ouvertes.fr/hal-00695940

M. Raptis, I. Kokkinos, and S. Soatto, Discovering discriminative action parts from mid-level video representations, 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012.
DOI : 10.1109/CVPR.2012.6247807

URL : https://hal.archives-ouvertes.fr/hal-00918807

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2008.
DOI : 10.1109/TPAMI.2016.2577031

URL : http://arxiv.org/abs/1506.01497

M. D. Rodriguez, J. Ahmed, and M. Shah, Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587727

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.152.8729

M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, CVPR 2011, 2008.
DOI : 10.1109/CVPR.2011.5995711

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.5551

S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos, Procedings of the British Machine Vision Conference 2016
DOI : 10.5244/C.30.58

C. Schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004.
DOI : 10.1109/ICPR.2004.1334462

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2008.

K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, 2015.

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, CRCV-TR-12-01, 2012.

J. R. Uijlings, K. E. Van-de-sande, T. Gevers, and A. W. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, 2013.
DOI : 10.1023/B:VISI.0000013087.49260.fb

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.361.3382

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell et al., Sequence to Sequence -- Video to Text, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.515

URL : http://arxiv.org/abs/1505.00487

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298935

URL : http://arxiv.org/abs/1411.4555

P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, 2001.
DOI : 10.1109/CVPR.2001.990517

H. Wang, D. Oneata, J. Verbeek, and C. Schmid, A Robust and Efficient Video Representation for Action Recognition, International Journal of Computer Vision, vol.103, issue.1, 2015.
DOI : 10.1109/ICCV.2013.442

URL : https://hal.archives-ouvertes.fr/hal-01145834

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin et al., Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV, 2016.
DOI : 10.1109/CVPR.2016.219

URL : http://arxiv.org/abs/1608.00859

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to Track for Spatio-Temporal Action Localization, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.362

URL : https://hal.archives-ouvertes.fr/hal-01159941

C. Xu and J. J. Corso, Actor-Action Semantic Segmentation with Grouping Process Models, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.7
DOI : 10.1109/CVPR.2016.336

URL : http://arxiv.org/abs/1512.09041

C. Xu, S. Hsieh, C. Xiong, and J. J. Corso, Can humans fly? Action understanding with multiple classes of actors, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007.
DOI : 10.1109/CVPR.2015.7298839

B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas et al., Human action recognition by learning bases of action attributes and parts, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126386

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.227.6992

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal et al., Describing Videos by Exploiting Temporal Structure, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.512

URL : http://arxiv.org/abs/1502.08029