S. Abu-el-haija, N. Kothari, J. Lee, P. Natsev, G. Toderici et al., YouTube- 8M: A large-scale video classification benchmark, 2016.

D. Arijon, Grammar of the film language, 1991.

R. Barker and H. Wright, Midwest and its children: The psychological ecology of an American town. Row, Peterson and Company, 1954.

M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as space-time shapes, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, 2005.
DOI : 10.1109/ICCV.2005.28

URL : http://www.wisdom.weizmann.ac.il/~yelenag/spaceTimeActionsTPAMI2007.pdf

F. Caba-heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, ActivityNet: A large-scale video benchmark for human activity understanding, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298698

J. Carreira and A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DOI : 10.1109/CVPR.2017.502

Y. Chao, Z. Wang, Y. He, J. Wang, and J. Deng, HICO: A Benchmark for Recognizing Human-Object Interactions in Images, 2015 IEEE International Conference on Computer Vision (ICCV), p.8, 2015.
DOI : 10.1109/ICCV.2015.122

K. Church and P. Hanks, Word association norms, mutual information, and lexicoraphy, Computational Linguistics, vol.16, issue.1 5, 1990.

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7298676

R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal et al., The ???Something Something??? Video Database for Learning and Evaluating Visual Common Sense, 2017 IEEE International Conference on Computer Vision (ICCV), p.8, 2017.
DOI : 10.1109/ICCV.2017.622

S. Gupta and J. Malik, Visual semantic role labeling. CoRR

K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.90

G. V. Horn and P. Perona, The devil is in the tails: Finegrained classification in the wild, 2017.

R. Hou, C. Chen, and M. Shah, Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos, 2017 IEEE International Conference on Computer Vision (ICCV), 2007.
DOI : 10.1109/ICCV.2017.620

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara et al., Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
DOI : 10.1109/CVPR.2017.351

H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev et al., The THUMOS challenge on action recognition for videos " in the wild FlowNet 2.0: Evolution of optical flow estimation with deep networks, CVPR, 2017.

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. Black, Towards Understanding Action Recognition, 2013 IEEE International Conference on Computer Vision, p.7
DOI : 10.1109/ICCV.2013.396

URL : https://hal.archives-ouvertes.fr/hal-00906902

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.223

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier et al., The Kinetics human action video dataset, 2017.

Y. Ke, R. Sukthankar, and M. Hebert, Efficient visual event detection using volumetric features, ICCV, 2005.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126543

H. W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly, vol.3, issue.1-2, pp.83-97, 1955.
DOI : 10.2140/pjm.1953.3.369

M. Marszalek, I. Laptev, and C. Schmid, Actions in context, 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
DOI : 10.1109/CVPR.2009.5206557

URL : https://hal.archives-ouvertes.fr/inria-00548645

P. Mettes, J. Van-gemert, and C. Snoek, Spot On: Action Localization from Pointly-Supervised Proposals, ECCV, 2016.
DOI : 10.1007/s11263-013-0636-x

P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders et al., TRECVID 2014 ? an overview of the goals, tasks, data, evaluation mechanisms and metrics, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01230444

X. Peng and C. Schmid, Multi-region Two-Stream R-CNN for Action Detection, ECCV, p.7, 2016.
DOI : 10.1109/CVPR.2015.7298735

URL : https://hal.archives-ouvertes.fr/hal-01349107

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS, 2015. 3
DOI : 10.1109/TPAMI.2016.2577031

M. Rodriguez, J. Ahmed, and M. Shah, Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition, 2008 IEEE Conference on Computer Vision and Pattern Recognition, p.3, 2008.
DOI : 10.1109/CVPR.2008.4587727

S. Saha, G. Sing, and F. Cuzzolin, AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.473

S. Saha, G. Singh, M. Sapienza, P. Torr, and F. Cuzzolin, Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos, Procedings of the British Machine Vision Conference 2016, 2016.
DOI : 10.5244/C.30.58

C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., 2004.
DOI : 10.1109/ICPR.2004.1334462

G. Sigurdsson, O. Russakovsky, A. Farhadi, I. Laptev, and A. Gupta, Much ado about time: Exhaustive annotation of temporal data, Conference on Human Computation and Crowdsourcing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01431527

G. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev et al., Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV, 2016
DOI : 10.1109/ICCV.2015.515

URL : https://hal.archives-ouvertes.fr/hal-01418216

G. Singh, S. Saha, M. Sapienza, P. Torr, and F. Cuzzolin, Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction, 2017 IEEE International Conference on Computer Vision (ICCV), p.7, 2017.
DOI : 10.1109/ICCV.2017.393

K. Soomro, A. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, p.3, 2012.

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the Inception Architecture for Computer Vision, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.308

V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, Action Tubelet Detector for Spatio-Temporal Action Localization, 2017 IEEE International Conference on Computer Vision (ICCV), 2007.
DOI : 10.1109/ICCV.2017.472

URL : https://hal.archives-ouvertes.fr/hal-01519812

L. Wang, Y. Qiao, X. Tang, and L. Van-gool, Actionness Estimation Using Hybrid Fully Convolutional Networks, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.296

P. Weinzaepfel, Z. Harchaoui, and C. Schmid, Learning to Track for Spatio-Temporal Action Localization, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.362

URL : https://hal.archives-ouvertes.fr/hal-01159941

P. Weinzaepfel, X. Martin, and C. Schmid, Towards weaklysupervised action localization, 2016.

L. Wu, C. Shen, and A. Van, PersonNet: Person re-identification with deep convolutional neural networks, 2016.

S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori et al., Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos, International Journal of Computer Vision, vol.25, issue.1, p.2017
DOI : 10.1109/CVPR.1992.223161

J. Yuan, Z. Liu, and Y. Wu, Discriminative subvolume search for efficient action detection, CVPR, 2009.

H. Zhao, Z. Yan, H. Wang, L. Torresani, and A. Torralba, SLAC: A sparsely labeled dataset for action classification and localization. arXiv preprint, 2017.

M. Zolfaghari, G. Oliveira, N. Sedaghat, and T. Brox, Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection, 2017 IEEE International Conference on Computer Vision (ICCV), 2017.
DOI : 10.1109/ICCV.2017.316