B. Tversky, J. Morrison, and J. Zacks, On bodies and events, The Imitative Mind, 2002.
DOI : 10.1017/CBO9780511489969.013

I. Laptev, M. Marsza?ek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008.
DOI : 10.1109/CVPR.2008.4587756
URL : https://hal.archives-ouvertes.fr/inria-00548659

J. C. Niebles, H. Wang, and L. Fei-fei, Unsupervised learning of human action categories using spatial-temporal words, pp.299-318, 2008.

C. Schüldtsch¨schüldt, I. Laptev, and B. Caputo, Recognizing human actions: a local SVM approach, ICPR, 2004.

H. Wang and C. Schmid, Action Recognition with Improved Trajectories, 2013 IEEE International Conference on Computer Vision, 2013.
DOI : 10.1109/ICCV.2013.441
URL : https://hal.archives-ouvertes.fr/hal-00873267

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, NIPS, 2014.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, NIPS, 2012.
DOI : 10.1162/neco.2009.10-08-881
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.299.205

J. Deng, W. Dong, R. Socher, L. Li, K. Li et al., ImageNet: A large-scale hierarchical image database, CVPR, 2009.

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, Learning deep features for scene recognition using places database, NIPS, 2014.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.81
URL : http://arxiv.org/abs/1311.2524

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.220

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Large-Scale Video Classification with Convolutional Neural Networks, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.223
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.471.3312

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
DOI : 10.1109/ICCV.2015.510
URL : http://arxiv.org/abs/1412.0767

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, CVPR, 2015.
DOI : 10.1109/tpami.2016.2599174
URL : http://arxiv.org/abs/1411.4389

S. Ji, W. Xu, M. Yang, and K. Yu, 3D Convolutional Neural Networks for Human Action Recognition, ICML, 2010.
DOI : 10.1109/TPAMI.2012.59
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.169.4046

G. W. Taylor, R. Fergus, Y. Lecun, and C. Bregler, Convolutional Learning of Spatio-temporal Features, ECCV, 2010.
DOI : 10.1007/978-3-642-15567-3_11
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.178.9267

G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, ECCVW, 2004.

F. Perronnin, J. Sánchez, and T. Mensink, Improving the Fisher Kernel for Large-Scale Image Classification, ECCV, 2010.
DOI : 10.1007/978-3-642-15561-1_11
URL : https://hal.archives-ouvertes.fr/inria-00548630

B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars, Modeling video evolution for action recognition, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
DOI : 10.1109/CVPR.2015.7299176

Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. Howard et al., Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, vol.1, issue.4, pp.541-551, 1989.
DOI : 10.1007/BF00133697

L. Wang, Y. Qiao, and X. Tang, Action recognition with trajectorypooled deep-convolutional descriptors, CVPR, 2015.
DOI : 10.1109/cvpr.2015.7299059
URL : http://arxiv.org/abs/1505.04868

L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, Towards good practices for very deep two-stream convnets, 2015.
DOI : 10.1007/978-3-319-46484-8_2
URL : http://arxiv.org/abs/1608.00859

H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, Dynamic Image Networks for Action Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.331

C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional twostream network fusion for video action recognition, CVPR, 2016.
DOI : 10.1109/cvpr.2016.213
URL : http://arxiv.org/abs/1604.06573

B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, Real-Time Action Recognition with Enhanced Motion Vector CNNs, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
DOI : 10.1109/CVPR.2016.297
URL : http://arxiv.org/abs/1604.07669

V. Kantorov and I. Laptev, Efficient Feature Extraction, Encoding, and Classification for Action Recognition, 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014.
DOI : 10.1109/CVPR.2014.332
URL : https://hal.archives-ouvertes.fr/hal-01058734

G. Farnebäck, Two-Frame Motion Estimation Based on Polynomial Expansion, SCIA, 2003.
DOI : 10.1007/3-540-45103-X_50

T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, High Accuracy Optical Flow Estimation Based on a Theory for Warping, ECCV, 2004.
DOI : 10.1007/978-3-540-24673-2_3
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.1732

K. Soomro, A. Zamir, and M. Shah, UCF101: A dataset of 101 human actions classes from videos in the wild, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: A large video database for human motion recognition, 2011 International Conference on Computer Vision, 2011.
DOI : 10.1109/ICCV.2011.6126543
URL : http://cbcl.mit.edu/publications/ps/Kuehne_etal_iccv11.pdf

Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj, Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition, CVPR, 2015.

J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga et al., Beyond short snippets: Deep networks for video classification, CVPR, 2015.

M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, 2014.
DOI : 10.1007/978-3-319-10590-1_53
URL : http://arxiv.org/abs/1311.2901