J. Aggarwal and M. Ryoo, Human activity analysis, ACM Computing Surveys, vol.43, issue.3, 2011.
DOI : 10.1145/1922649.1922653

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, ECCV, 2004.
DOI : 10.1007/978-3-540-24673-2_3

V. Delaitre, J. Sivic, and I. Laptev, Learning person-object interactions for action recognition in still images, NIPS, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00648156

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.

P. Felzenszwalb, R. Girshick, D. Mcallester, and D. Ramanan, Object Detection with Discriminatively Trained Part-Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.32, issue.9, 2010.
DOI : 10.1109/TPAMI.2009.167

A. Gaidon, Z. Harchaoui, and C. Schmid, Temporal Localization of Actions with Actoms, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.11, 2013.
DOI : 10.1109/TPAMI.2013.65

URL : https://hal.archives-ouvertes.fr/hal-00687312

J. Gall, N. Razavi, and L. Van-gool, On-line Adaption of Class-specific Codebooks for Instance Tracking, Procedings of the British Machine Vision Conference 2010, 2010.
DOI : 10.5244/C.24.55

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, p.3
DOI : 10.1109/CVPR.2014.81

A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber, Fast image scanning with deep max-pooling convolutional neural networks, 2013 IEEE International Conference on Image Processing, 2013.
DOI : 10.1109/ICIP.2013.6738831

G. Gkioxari and J. Malik, Finding action tubes, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
DOI : 10.1109/CVPR.2015.7298676

S. Hare, A. Saffari, and P. Torr, Struck: Structured output tracking with kernels, ICCV, 2011.

J. Hosang, R. Benenson, P. Dollár, and B. Schiele, What makes for effective detection proposals? arXiv, 2015.

Y. Hua, K. Alahari, and C. Schmid, Occlusion and Motion Reasoning for Long-Term Tracking, ECCV, 2014.
DOI : 10.1007/978-3-319-10599-4_12

URL : https://hal.archives-ouvertes.fr/hal-01020149

M. Jain, J. C. Van-gemert, H. Jégou, P. Bouthemy, and C. G. Snoek, Action Localization with Tubelets from Motion, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.100

URL : https://hal.archives-ouvertes.fr/hal-00996844

H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, Towards Understanding Action Recognition, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.396

URL : https://hal.archives-ouvertes.fr/hal-00906902

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long et al., Caffe, Proceedings of the ACM International Conference on Multimedia, MM '14
DOI : 10.1145/2647868.2654889

Y. Jiang, J. Liu, A. Zamir, G. Toderici, I. Laptev et al., THUMOS challenge: Action recognition with a large number of classes, 2014.

Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learningdetection, IEEE Trans. PAMI, issue.2, 2012.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2002.
DOI : 10.1109/CVPR.2014.223

A. Kläser, M. Marszalek, C. Schmid, and A. Zisserman, Human Focused Action Localization in Video, International Workshop on Sign, Gesture, and Activity (SGA), 2010.
DOI : 10.1007/978-3-642-35749-7_17

A. Kläser, M. Marszaek, and C. Schmid, A Spatio-Temporal Descriptor Based on 3D-Gradients, Procedings of the British Machine Vision Conference 2008, 2008.
DOI : 10.5244/C.22.99

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126543

T. Lan, Y. Wang, and G. Mori, Discriminative figure-centric models for joint action localization and recognition, ICCV, 2005.

I. Laptev, M. Marsza?ek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756

URL : https://hal.archives-ouvertes.fr/inria-00548659

I. Laptev and P. Pérez, Retrieving actions in movies, 2007 IEEE 11th International Conference on Computer Vision, 2007.
DOI : 10.1109/ICCV.2007.4409105

S. Ma, J. Zhang, N. Ikizler-cinbis, and S. Sclaroff, Action Recognition and Localization by Hierarchical Space-Time Segments, 2013 IEEE International Conference on Computer Vision
DOI : 10.1109/ICCV.2013.341

D. Oneata, J. Verbeek, and C. Schmid, Efficient Action Localization with Approximately Normalized Fisher Vectors, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.326

URL : https://hal.archives-ouvertes.fr/hal-00979594

D. Oneata, J. Verbeek, and C. Schmid, The LEAR submission at Thumos, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01074442

R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing, vol.28, issue.6, 2010.
DOI : 10.1016/j.imavis.2009.11.014

A. Prest, V. Ferrari, and C. Schmid, Explicit Modeling of Human-Object Interactions in Realistic Videos, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.35, issue.4, 2012.
DOI : 10.1109/TPAMI.2012.175

URL : https://hal.archives-ouvertes.fr/hal-00720847

M. D. Rodriguez, J. Ahmed, and M. Shah, Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587727

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus et al., Overfeat: Integrated recognition, localization and detection using CNN, ICLR, 2014.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014, p.3

K. Soomro, A. R. Zamir, and M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild, 2005.

Y. Tian, R. Sukthankar, and M. Shah, Spatiotemporal Deformable Part Models for Action Detection, 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013.
DOI : 10.1109/CVPR.2013.341

D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, C3D: generic features for video analysis. arXiv, 2014.

D. Tran and J. Yuan, Max-margin structured output regression for spatio-temporal action localization, NIPS, 2012.

J. Uijlings, K. Van-de-sande, T. Gevers, and A. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, vol.57, issue.1, 2013.
DOI : 10.1007/s11263-013-0620-5

H. Wang, A. Kläser, C. Schmid, and C. Liu, Dense Trajectories and Motion Boundary Descriptors for Action Recognition, International Journal of Computer Vision, vol.73, issue.2, 2005.
DOI : 10.1007/s11263-012-0594-8

URL : https://hal.archives-ouvertes.fr/hal-00725627

H. Wang, D. Oneata, J. Verbeek, and C. Schmid, A Robust and Efficient Video Representation for Action Recognition, International Journal of Computer Vision, vol.103, issue.1, p.7, 2015.
DOI : 10.1007/s11263-015-0846-5

URL : https://hal.archives-ouvertes.fr/hal-01145834

L. Wang, Y. Qiao, and X. Tang, Video Action Detection with Relational Dynamic-Poselets, ECCV, 2014.
DOI : 10.1007/978-3-319-10602-1_37

G. Yu and J. Yuan, Fast action proposals for human action detection and search, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DOI : 10.1109/CVPR.2015.7298735

C. L. Zitnick and P. Dollár, Edge Boxes: Locating Object Proposals from Edges, ECCV, 2006.
DOI : 10.1007/978-3-319-10602-1_26

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.453.5208